Thread: stopgap fix for signal handling during restore_command

stopgap fix for signal handling during restore_command

From
Nathan Bossart
Date:
On Wed, Feb 22, 2023 at 09:48:10PM +1300, Thomas Munro wrote:
> On Tue, Feb 21, 2023 at 5:50 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> I'm happy to create a new thread if needed, but I can't tell if there is
>> any interest in this stopgap/back-branch fix.  Perhaps we should just jump
>> straight to the long-term fix that Thomas is looking into.
> 
> Unfortunately the latch-friendly subprocess module proposal I was
> talking about would be for 17.  I may post a thread fairly soon with
> design ideas + list of problems and decision points as I see them, and
> hopefully some sketch code, but it won't be a proposal for [/me checks
> calendar] next week's commitfest and probably wouldn't be appropriate
> in a final commitfest anyway, and I also have some other existing
> stuff to clear first.  So please do continue with the stopgap ideas.

Okay, here is a new thread...

Since v8.4, the startup process will proc_exit() immediately within its
SIGTERM handler while the restore_command executes via system().   Some
recent changes added unsafe code to the section where this behavior is
enabled [0].  The long-term fix likely includes moving away from system()
completely, but we may want to have a stopgap/back-branch fix while that is
under development.

I've attached a patch set for a proposed stopgap fix.  0001 simply moves
the extra code outside of the Pre/PostRestoreCommand() block so that only
system() is executed while the SIGTERM handler might proc_exit().  This
restores the behavior that was in place from v8.4 to v14, so I don't expect
it to be too controversial.  0002 adds code to startup's SIGTERM handler to
call _exit() instead of proc_exit() if we are in a forked process from
system(), etc.  It also adds assertions to ensure proc_exit(), ProcKill(),
and AuxiliaryProcKill() are not called within such forked processes.

Thoughts?

[0] https://postgr.es/m/20230201105514.rsjl4bnhb65giyvo%40alap3.anarazel.de

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: stopgap fix for signal handling during restore_command

From
Thomas Munro
Date:
On Fri, Feb 24, 2023 at 12:15 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
> Thoughts?

I think you should have a trailing \n when writing to stderr.

Here's that reproducer I speculated about (sorry I confused SIGQUIT
and SIGTERM in my earlier email, ENOCOFFEE).  Seems to do the job, and
I tested on a Linux box for good measure.  If you comment out the
kill(), "check PROVE_TESTS=t/002_archiving.pl" works fine
(demonstrating that that definition of system() works fine).  With the
kill(), it reliably reaches 'TRAP: failed Assert("latch->owner_pid ==
MyProcPid")' without your patch, and with your patch it avoids it.  (I
believe glibc's system() could reach it too with the right timing, but
I didn't try, my point being that the use of the OpenBSD system() here
is only  because it's easier to grok and to wrangle.)

Attachment

Re: stopgap fix for signal handling during restore_command

From
Thomas Munro
Date:
On Fri, Feb 24, 2023 at 1:25 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> ENOCOFFEE

Erm, I realised after sending that I'd accidentally sent a version
that uses fork() anyway, and now if I change it back to vfork() it
doesn't fail the way I wanted to demonstrate, at least on Linux.  I
don't have time or desire to dig into how Linux vfork() really works
so I'll leave it at that... but the patch as posted does seem to be a
useful tool for understanding this failure... please just ignore the
confused comments about fork() vs vfork() therein.



Re: stopgap fix for signal handling during restore_command

From
Nathan Bossart
Date:
On Fri, Feb 24, 2023 at 01:25:01PM +1300, Thomas Munro wrote:
> I think you should have a trailing \n when writing to stderr.

Oops.  I added that in v7.

> Here's that reproducer I speculated about (sorry I confused SIGQUIT
> and SIGTERM in my earlier email, ENOCOFFEE).  Seems to do the job, and
> I tested on a Linux box for good measure.  If you comment out the
> kill(), "check PROVE_TESTS=t/002_archiving.pl" works fine
> (demonstrating that that definition of system() works fine).  With the
> kill(), it reliably reaches 'TRAP: failed Assert("latch->owner_pid ==
> MyProcPid")' without your patch, and with your patch it avoids it.  (I
> believe glibc's system() could reach it too with the right timing, but
> I didn't try, my point being that the use of the OpenBSD system() here
> is only  because it's easier to grok and to wrangle.)

Thanks for providing the reproducer!  I am seeing the behavior that you
described on my Linux machine.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: stopgap fix for signal handling during restore_command

From
Andres Freund
Date:
Hi,

On 2023-02-23 20:33:23 -0800, Nathan Bossart wrote:>  
>      if (in_restore_command)
> -        proc_exit(1);
> +    {
> +        /*
> +         * If we are in a child process (e.g., forked by system() in
> +         * RestoreArchivedFile()), we don't want to call any exit callbacks.
> +         * The parent will take care of that.
> +         */
> +        if (MyProcPid == (int) getpid())
> +            proc_exit(1);
> +        else
> +        {
> +            const char    msg[] = "StartupProcShutdownHandler() called in child process\n";
> +            int            rc pg_attribute_unused();
> +
> +            rc = write(STDERR_FILENO, msg, sizeof(msg));
> +            _exit(1);
> +        }
> +    }

Why do we need that rc variable? Don't we normally get away with (void)
write(...)?


> diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
> index 22b4278610..e3da0622d7 100644
> --- a/src/backend/storage/lmgr/proc.c
> +++ b/src/backend/storage/lmgr/proc.c
> @@ -805,6 +805,7 @@ ProcKill(int code, Datum arg)
>      dlist_head *procgloballist;
>  
>      Assert(MyProc != NULL);
> +    Assert(MyProcPid == (int) getpid());  /* not safe if forked by system(), etc. */
>  
>      /* Make sure we're out of the sync rep lists */
>      SyncRepCleanupAtProcExit();
> @@ -925,6 +926,7 @@ AuxiliaryProcKill(int code, Datum arg)
>      PGPROC       *proc;
>  
>      Assert(proctype >= 0 && proctype < NUM_AUXILIARY_PROCS);
> +    Assert(MyProcPid == (int) getpid());  /* not safe if forked by system(), etc. */
>  
>      auxproc = &AuxiliaryProcs[proctype];
>  
> -- 
> 2.25.1

I think the much more interesting assertion here would be to check that
MyProc->pid equals the current pid.

Greetings,

Andres Freund



Re: stopgap fix for signal handling during restore_command

From
Nathan Bossart
Date:
On Sat, Feb 25, 2023 at 11:07:42AM -0800, Andres Freund wrote:
> On 2023-02-23 20:33:23 -0800, Nathan Bossart wrote:>  
>>      if (in_restore_command)
>> -        proc_exit(1);
>> +    {
>> +        /*
>> +         * If we are in a child process (e.g., forked by system() in
>> +         * RestoreArchivedFile()), we don't want to call any exit callbacks.
>> +         * The parent will take care of that.
>> +         */
>> +        if (MyProcPid == (int) getpid())
>> +            proc_exit(1);
>> +        else
>> +        {
>> +            const char    msg[] = "StartupProcShutdownHandler() called in child process\n";
>> +            int            rc pg_attribute_unused();
>> +
>> +            rc = write(STDERR_FILENO, msg, sizeof(msg));
>> +            _exit(1);
>> +        }
>> +    }
> 
> Why do we need that rc variable? Don't we normally get away with (void)
> write(...)?

My compiler complains about that.  :/

    ../postgresql/src/backend/postmaster/startup.c: In function ‘StartupProcShutdownHandler’:
    ../postgresql/src/backend/postmaster/startup.c:139:11: error: ignoring return value of ‘write’, declared with
attributewarn_unused_result [-Werror=unused-result]
 
      139 |    (void) write(STDERR_FILENO, msg, sizeof(msg));
          |           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    cc1: all warnings being treated as errors

>> diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
>> index 22b4278610..e3da0622d7 100644
>> --- a/src/backend/storage/lmgr/proc.c
>> +++ b/src/backend/storage/lmgr/proc.c
>> @@ -805,6 +805,7 @@ ProcKill(int code, Datum arg)
>>      dlist_head *procgloballist;
>>  
>>      Assert(MyProc != NULL);
>> +    Assert(MyProcPid == (int) getpid());  /* not safe if forked by system(), etc. */
>>  
>>      /* Make sure we're out of the sync rep lists */
>>      SyncRepCleanupAtProcExit();
>> @@ -925,6 +926,7 @@ AuxiliaryProcKill(int code, Datum arg)
>>      PGPROC       *proc;
>>  
>>      Assert(proctype >= 0 && proctype < NUM_AUXILIARY_PROCS);
>> +    Assert(MyProcPid == (int) getpid());  /* not safe if forked by system(), etc. */
>>  
>>      auxproc = &AuxiliaryProcs[proctype];
>>  
>> -- 
>> 2.25.1
> 
> I think the much more interesting assertion here would be to check that
> MyProc->pid equals the current pid.

I don't mind changing this, but why is this a more interesting assertion?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: stopgap fix for signal handling during restore_command

From
Nathan Bossart
Date:
On Sat, Feb 25, 2023 at 11:28:25AM -0800, Nathan Bossart wrote:
> On Sat, Feb 25, 2023 at 11:07:42AM -0800, Andres Freund wrote:
>> I think the much more interesting assertion here would be to check that
>> MyProc->pid equals the current pid.
> 
> I don't mind changing this, but why is this a more interesting assertion?

Here is a new patch set with this change.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: stopgap fix for signal handling during restore_command

From
Andres Freund
Date:
Hi,

On 2023-02-25 11:28:25 -0800, Nathan Bossart wrote:
> On Sat, Feb 25, 2023 at 11:07:42AM -0800, Andres Freund wrote:
> > Why do we need that rc variable? Don't we normally get away with (void)
> > write(...)?
> 
> My compiler complains about that.  :/
> 
>     ../postgresql/src/backend/postmaster/startup.c: In function ‘StartupProcShutdownHandler’:
>     ../postgresql/src/backend/postmaster/startup.c:139:11: error: ignoring return value of ‘write’, declared with
attributewarn_unused_result [-Werror=unused-result]
 
>       139 |    (void) write(STDERR_FILENO, msg, sizeof(msg));
>           |           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>     cc1: all warnings being treated as errors

Ick.  I guess we've already encountered this, because we've apparently removed
all the (void) write cases. Which I am certain we had at some point. We still
do it for a bunch of other functions though.  Ah, yes: aa90e148ca7,
27314d32a88, 6c72a28e5ce etc.

I think I opined on this before, but we really ought to have a function to do
some minimal signal safe output. Implemented centrally, instead of being open
coded in a bunch of places.


> >> diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
> >> index 22b4278610..e3da0622d7 100644
> >> --- a/src/backend/storage/lmgr/proc.c
> >> +++ b/src/backend/storage/lmgr/proc.c
> >> @@ -805,6 +805,7 @@ ProcKill(int code, Datum arg)
> >>      dlist_head *procgloballist;
> >>  
> >>      Assert(MyProc != NULL);
> >> +    Assert(MyProcPid == (int) getpid());  /* not safe if forked by system(), etc. */
> >>  
> >>      /* Make sure we're out of the sync rep lists */
> >>      SyncRepCleanupAtProcExit();
> >> @@ -925,6 +926,7 @@ AuxiliaryProcKill(int code, Datum arg)
> >>      PGPROC       *proc;
> >>  
> >>      Assert(proctype >= 0 && proctype < NUM_AUXILIARY_PROCS);
> >> +    Assert(MyProcPid == (int) getpid());  /* not safe if forked by system(), etc. */
> >>  
> >>      auxproc = &AuxiliaryProcs[proctype];
> >>  
> >> -- 
> >> 2.25.1
> > 
> > I think the much more interesting assertion here would be to check that
> > MyProc->pid equals the current pid.
> 
> I don't mind changing this, but why is this a more interesting assertion?

Because we so far have little to no protection against some state corruption
leading to releasing PGPROC that's not ours.  I didn't actually mean that we
shouldn't check that MyProcPid == (int) getpid(), just that the much more
interesting case to check is that MyProc->pid matches, because that protect
against multiple releases, releasing the wrong slot, etc.

Greetings,

Andres Freund



Re: stopgap fix for signal handling during restore_command

From
Nathan Bossart
Date:
On Sat, Feb 25, 2023 at 11:52:53AM -0800, Andres Freund wrote:
> I think I opined on this before, but we really ought to have a function to do
> some minimal signal safe output. Implemented centrally, instead of being open
> coded in a bunch of places.

While looking around for the right place to put this, I noticed that
there's a write_stderr() function in elog.c that we might be able to use.
I used that in v9.  WDYT?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: stopgap fix for signal handling during restore_command

From
Andres Freund
Date:
Hi,

On 2023-02-25 14:06:29 -0800, Nathan Bossart wrote:
> On Sat, Feb 25, 2023 at 11:52:53AM -0800, Andres Freund wrote:
> > I think I opined on this before, but we really ought to have a function to do
> > some minimal signal safe output. Implemented centrally, instead of being open
> > coded in a bunch of places.
> 
> While looking around for the right place to put this, I noticed that
> there's a write_stderr() function in elog.c that we might be able to use.
> I used that in v9.  WDYT?

write_stderr() isn't signal safe, from what I can tell.



Re: stopgap fix for signal handling during restore_command

From
Nathan Bossart
Date:
On Sun, Feb 26, 2023 at 10:00:29AM -0800, Andres Freund wrote:
> On 2023-02-25 14:06:29 -0800, Nathan Bossart wrote:
>> On Sat, Feb 25, 2023 at 11:52:53AM -0800, Andres Freund wrote:
>> > I think I opined on this before, but we really ought to have a function to do
>> > some minimal signal safe output. Implemented centrally, instead of being open
>> > coded in a bunch of places.
>> 
>> While looking around for the right place to put this, I noticed that
>> there's a write_stderr() function in elog.c that we might be able to use.
>> I used that in v9.  WDYT?
> 
> write_stderr() isn't signal safe, from what I can tell.

*facepalm*  Sorry.

What precisely did you have in mind?  AFAICT you are asking for a wrapper
around write().

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: stopgap fix for signal handling during restore_command

From
Andres Freund
Date:
Hi,

On 2023-02-26 11:39:00 -0800, Nathan Bossart wrote:
> On Sun, Feb 26, 2023 at 10:00:29AM -0800, Andres Freund wrote:
> > On 2023-02-25 14:06:29 -0800, Nathan Bossart wrote:
> >> On Sat, Feb 25, 2023 at 11:52:53AM -0800, Andres Freund wrote:
> >> > I think I opined on this before, but we really ought to have a function to do
> >> > some minimal signal safe output. Implemented centrally, instead of being open
> >> > coded in a bunch of places.
> >> 
> >> While looking around for the right place to put this, I noticed that
> >> there's a write_stderr() function in elog.c that we might be able to use.
> >> I used that in v9.  WDYT?
> > 
> > write_stderr() isn't signal safe, from what I can tell.
> 
> *facepalm*  Sorry.
> 
> What precisely did you have in mind?  AFAICT you are asking for a wrapper
> around write().

Partially I just want something that can easily be searched for, that can have
comments attached to it documenting why what it is doing is safe.

It'd not be a huge amount of work to have a slow and restricted string
interpolation support, to make it easier to write messages. Converting floats
is probably too hard to do safely, and I'm not sure %m can safely be
supported. But basic things like %d would be pretty simple.

Basically a loop around the format string that directly writes to stderr using
write(), and only supports a signal safe subset of normal format strings.

Greetings,

Andres Freund



Re: stopgap fix for signal handling during restore_command

From
Nathan Bossart
Date:
On Sun, Feb 26, 2023 at 12:12:27PM -0800, Andres Freund wrote:
> On 2023-02-26 11:39:00 -0800, Nathan Bossart wrote:
>> What precisely did you have in mind?  AFAICT you are asking for a wrapper
>> around write().
> 
> Partially I just want something that can easily be searched for, that can have
> comments attached to it documenting why what it is doing is safe.
> 
> It'd not be a huge amount of work to have a slow and restricted string
> interpolation support, to make it easier to write messages. Converting floats
> is probably too hard to do safely, and I'm not sure %m can safely be
> supported. But basic things like %d would be pretty simple.
> 
> Basically a loop around the format string that directly writes to stderr using
> write(), and only supports a signal safe subset of normal format strings.

Got it, thanks.  I will try to put something together along these lines,
although I don't know if I'll pick up the interpolation support in this
thread.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: stopgap fix for signal handling during restore_command

From
Nathan Bossart
Date:
On Tue, Feb 28, 2023 at 08:36:03PM -0800, Nathan Bossart wrote:
> On Sun, Feb 26, 2023 at 12:12:27PM -0800, Andres Freund wrote:
>> Partially I just want something that can easily be searched for, that can have
>> comments attached to it documenting why what it is doing is safe.
>> 
>> It'd not be a huge amount of work to have a slow and restricted string
>> interpolation support, to make it easier to write messages. Converting floats
>> is probably too hard to do safely, and I'm not sure %m can safely be
>> supported. But basic things like %d would be pretty simple.
>> 
>> Basically a loop around the format string that directly writes to stderr using
>> write(), and only supports a signal safe subset of normal format strings.
> 
> Got it, thanks.  I will try to put something together along these lines,
> although I don't know if I'll pick up the interpolation support in this
> thread.

Here is an attempt at adding a signal safe function for writing to STDERR.

I didn't add support for format strings, but looking ahead, I think one
challenge will be avoiding va_start() and friends.  In any case, IMO format
string support probably deserves its own thread.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: stopgap fix for signal handling during restore_command

From
Andres Freund
Date:
Hi,

On 2023-03-01 14:47:51 -0800, Nathan Bossart wrote:
> On Tue, Feb 28, 2023 at 08:36:03PM -0800, Nathan Bossart wrote:
> > On Sun, Feb 26, 2023 at 12:12:27PM -0800, Andres Freund wrote:
> >> Partially I just want something that can easily be searched for, that can have
> >> comments attached to it documenting why what it is doing is safe.
> >> 
> >> It'd not be a huge amount of work to have a slow and restricted string
> >> interpolation support, to make it easier to write messages. Converting floats
> >> is probably too hard to do safely, and I'm not sure %m can safely be
> >> supported. But basic things like %d would be pretty simple.
> >> 
> >> Basically a loop around the format string that directly writes to stderr using
> >> write(), and only supports a signal safe subset of normal format strings.
> > 
> > Got it, thanks.  I will try to put something together along these lines,
> > although I don't know if I'll pick up the interpolation support in this
> > thread.
> 
> Here is an attempt at adding a signal safe function for writing to STDERR.

Cool.

> I didn't add support for format strings, but looking ahead, I think one
> challenge will be avoiding va_start() and friends.  In any case, IMO format
> string support probably deserves its own thread.

Makes sense to split that off.

FWIW, I think we could rely on va_start() et al to be signal safe. The
standardese isn't super clear about this, because they aren't functions, and
posix only talks about functions being async signal safe...

Greetings,

Andres Freund



Re: stopgap fix for signal handling during restore_command

From
Nathan Bossart
Date:
On Wed, Mar 01, 2023 at 03:13:04PM -0800, Andres Freund wrote:
> FWIW, I think we could rely on va_start() et al to be signal safe. The
> standardese isn't super clear about this, because they aren't functions, and
> posix only talks about functions being async signal safe...

Good to know.  I couldn't tell whether that was a safe assumption from
briefly reading around.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: stopgap fix for signal handling during restore_command

From
Nathan Bossart
Date:
On Wed, Mar 01, 2023 at 03:13:04PM -0800, Andres Freund wrote:
> On 2023-03-01 14:47:51 -0800, Nathan Bossart wrote:
>> Here is an attempt at adding a signal safe function for writing to STDERR.
> 
> Cool.

I'm gently bumping this thread to see if anyone had additional feedback on
the proposed patches [0].  The intent was to back-patch these as needed and
to pursue a long-term fix in v17.  Are there any concerns with that?

[0] https://postgr.es/m/20230301224751.GA1823946%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: stopgap fix for signal handling during restore_command

From
Peter Eisentraut
Date:
On 22.04.23 01:07, Nathan Bossart wrote:
> On Wed, Mar 01, 2023 at 03:13:04PM -0800, Andres Freund wrote:
>> On 2023-03-01 14:47:51 -0800, Nathan Bossart wrote:
>>> Here is an attempt at adding a signal safe function for writing to STDERR.
>>
>> Cool.
> 
> I'm gently bumping this thread to see if anyone had additional feedback on
> the proposed patches [0].  The intent was to back-patch these as needed and
> to pursue a long-term fix in v17.  Are there any concerns with that?
> 
> [0] https://postgr.es/m/20230301224751.GA1823946%40nathanxps13

Is this still being contemplated?  What is the status of this?



Re: stopgap fix for signal handling during restore_command

From
Nathan Bossart
Date:
On Sun, Oct 01, 2023 at 08:50:15PM +0200, Peter Eisentraut wrote:
> Is this still being contemplated?  What is the status of this?

I'll plan on committing this in the next couple of days.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: stopgap fix for signal handling during restore_command

From
Andres Freund
Date:
On 2023-03-01 14:47:51 -0800, Nathan Bossart wrote:
> Subject: [PATCH v10 1/2] Move extra code out of the Pre/PostRestoreCommand()
>  block.

LGTM


> From fb6957da01f11b75d1a1966f32b00e2e77c789a0 Mon Sep 17 00:00:00 2001
> From: Nathan Bossart <nathandbossart@gmail.com>
> Date: Tue, 14 Feb 2023 09:44:53 -0800
> Subject: [PATCH v10 2/2] Don't proc_exit() in startup's SIGTERM handler if
>  forked by system().
> 
> Instead, emit a message to STDERR and _exit() in this case.  This
> change also adds assertions to proc_exit(), ProcKill(), and
> AuxiliaryProcKill() to verify that these functions are not called
> by a process forked by system(), etc.
> ---
>  src/backend/postmaster/startup.c | 17 ++++++++++++++++-
>  src/backend/storage/ipc/ipc.c    |  3 +++
>  src/backend/storage/lmgr/proc.c  |  2 ++
>  src/backend/utils/error/elog.c   | 28 ++++++++++++++++++++++++++++
>  src/include/utils/elog.h         |  6 +-----
>  5 files changed, 50 insertions(+), 6 deletions(-)



> diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
> index 22b4278610..b9e2c3aafe 100644
> --- a/src/backend/storage/lmgr/proc.c
> +++ b/src/backend/storage/lmgr/proc.c
> @@ -805,6 +805,7 @@ ProcKill(int code, Datum arg)
>      dlist_head *procgloballist;
>  
>      Assert(MyProc != NULL);
> +    Assert(MyProc->pid == (int) getpid());  /* not safe if forked by system(), etc. */
>  
>      /* Make sure we're out of the sync rep lists */
>      SyncRepCleanupAtProcExit();
> @@ -925,6 +926,7 @@ AuxiliaryProcKill(int code, Datum arg)
>      PGPROC       *proc;
>  
>      Assert(proctype >= 0 && proctype < NUM_AUXILIARY_PROCS);
> +    Assert(MyProc->pid == (int) getpid());  /* not safe if forked by system(), etc. */
>  
>      auxproc = &AuxiliaryProcs[proctype];
>  

I'd make these elog(PANIC), I think. The paths are not performance critical
enough that a single branch hurts, so the overhead of the check is irrelevant,
and the consequences of calling ProcKill() twice for the same process are very
severe.


> +/*
> + * Write a message to STDERR using only async-signal-safe functions.  This can
> + * be used to safely emit a message from a signal handler.
> + *
> + * TODO: It is likely possible to safely do a limited amount of string
> + * interpolation (e.g., %s and %d), but that is not presently supported.
> + */
> +void
> +write_stderr_signal_safe(const char *fmt)

As is, this isn't a format, so I'd probably just name it s or str :)


> -/*
> - * Write errors to stderr (or by equal means when stderr is
> - * not available). Used before ereport/elog can be used
> - * safely (memory context, GUC load etc)
> - */
>  extern void write_stderr(const char *fmt,...) pg_attribute_printf(1, 2);
> +extern void write_stderr_signal_safe(const char *fmt);

Not sure why you removed the comment?

Greetings,

Andres Freund



Re: stopgap fix for signal handling during restore_command

From
Nathan Bossart
Date:
On Tue, Oct 10, 2023 at 04:40:28PM -0700, Andres Freund wrote:
> On 2023-03-01 14:47:51 -0800, Nathan Bossart wrote:
>> diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
>> index 22b4278610..b9e2c3aafe 100644
>> --- a/src/backend/storage/lmgr/proc.c
>> +++ b/src/backend/storage/lmgr/proc.c
>> @@ -805,6 +805,7 @@ ProcKill(int code, Datum arg)
>>      dlist_head *procgloballist;
>>  
>>      Assert(MyProc != NULL);
>> +    Assert(MyProc->pid == (int) getpid());  /* not safe if forked by system(), etc. */
>>  
>>      /* Make sure we're out of the sync rep lists */
>>      SyncRepCleanupAtProcExit();
>> @@ -925,6 +926,7 @@ AuxiliaryProcKill(int code, Datum arg)
>>      PGPROC       *proc;
>>  
>>      Assert(proctype >= 0 && proctype < NUM_AUXILIARY_PROCS);
>> +    Assert(MyProc->pid == (int) getpid());  /* not safe if forked by system(), etc. */
>>  
>>      auxproc = &AuxiliaryProcs[proctype];
>>  
> 
> I'd make these elog(PANIC), I think. The paths are not performance critical
> enough that a single branch hurts, so the overhead of the check is irrelevant,
> and the consequences of calling ProcKill() twice for the same process are very
> severe.

Right.  Should we write_stderr_signal_safe() and then abort() to keep these
paths async-signal-safe?

>> +/*
>> + * Write a message to STDERR using only async-signal-safe functions.  This can
>> + * be used to safely emit a message from a signal handler.
>> + *
>> + * TODO: It is likely possible to safely do a limited amount of string
>> + * interpolation (e.g., %s and %d), but that is not presently supported.
>> + */
>> +void
>> +write_stderr_signal_safe(const char *fmt)
> 
> As is, this isn't a format, so I'd probably just name it s or str :)

Yup.

>> -/*
>> - * Write errors to stderr (or by equal means when stderr is
>> - * not available). Used before ereport/elog can be used
>> - * safely (memory context, GUC load etc)
>> - */
>>  extern void write_stderr(const char *fmt,...) pg_attribute_printf(1, 2);
>> +extern void write_stderr_signal_safe(const char *fmt);
> 
> Not sure why you removed the comment?

I think it was because it's an exact copy of the comment above the function
in elog.c, and I didn't want to give the impression that it applied to the
signal-safe one, too.  I added it back along with a new comment for
write_stderr_signal_safe().

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: stopgap fix for signal handling during restore_command

From
Nathan Bossart
Date:
On Tue, Oct 10, 2023 at 09:54:18PM -0500, Nathan Bossart wrote:
> On Tue, Oct 10, 2023 at 04:40:28PM -0700, Andres Freund wrote:
>> I'd make these elog(PANIC), I think. The paths are not performance critical
>> enough that a single branch hurts, so the overhead of the check is irrelevant,
>> and the consequences of calling ProcKill() twice for the same process are very
>> severe.
> 
> Right.  Should we write_stderr_signal_safe() and then abort() to keep these
> paths async-signal-safe?

Hm.  I see that elog() is called elsewhere in proc_exit(), and it does not
appear to be async-signal-safe.  Am I missing something?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: stopgap fix for signal handling during restore_command

From
Andres Freund
Date:
Hi,

On 2023-10-10 22:29:34 -0500, Nathan Bossart wrote:
> On Tue, Oct 10, 2023 at 09:54:18PM -0500, Nathan Bossart wrote:
> > On Tue, Oct 10, 2023 at 04:40:28PM -0700, Andres Freund wrote:
> >> I'd make these elog(PANIC), I think. The paths are not performance critical
> >> enough that a single branch hurts, so the overhead of the check is irrelevant,
> >> and the consequences of calling ProcKill() twice for the same process are very
> >> severe.
> > 
> > Right.  Should we write_stderr_signal_safe() and then abort() to keep these
> > paths async-signal-safe?
> 
> Hm.  I see that elog() is called elsewhere in proc_exit(), and it does not
> appear to be async-signal-safe.  Am I missing something?

We shouldn't call proc_exit() in a signal handler. We perhaps have a few
remaining calls left, but we should (and I think in some cases are) working on
removing those.

Greetings,

Andres Freund



Re: stopgap fix for signal handling during restore_command

From
Michael Paquier
Date:
On Tue, Oct 10, 2023 at 08:39:29PM -0700, Andres Freund wrote:
> We shouldn't call proc_exit() in a signal handler. We perhaps have a few
> remaining calls left, but we should (and I think in some cases are) working on
> removing those.

Hmm.  I don't recall anything remaining, even after a quick check.
FWIW, I was under the impression that Thomas' work done in
0da096d78e1e4 has cleaned up the last bits of that.
--
Michael

Attachment

Re: stopgap fix for signal handling during restore_command

From
Nathan Bossart
Date:
On Wed, Oct 11, 2023 at 01:02:14PM +0900, Michael Paquier wrote:
> On Tue, Oct 10, 2023 at 08:39:29PM -0700, Andres Freund wrote:
>> We shouldn't call proc_exit() in a signal handler. We perhaps have a few
>> remaining calls left, but we should (and I think in some cases are) working on
>> removing those.

Got it.

> Hmm.  I don't recall anything remaining, even after a quick check.
> FWIW, I was under the impression that Thomas' work done in
> 0da096d78e1e4 has cleaned up the last bits of that.

StartupProcShutdownHandler() remains, at least.  Of the other items in
Tom's list from 2020 [0], bgworker_die() and FloatExceptionHandler() are
also still unsafe.  RecoveryConflictInterrupt() should be fixed by 0da096d,
and StandbyDeadLockHandler() and StandbyTimeoutHandler() should be fixed by
8900b5a and 8f1537d, respectively.

[0] https://postgr.es/m/148145.1599703626%40sss.pgh.pa.us

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: stopgap fix for signal handling during restore_command

From
Nathan Bossart
Date:
Committed and back-patched.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: stopgap fix for signal handling during restore_command

From
Nathan Bossart
Date:
On Tue, Oct 17, 2023 at 10:46:47AM -0500, Nathan Bossart wrote:
> Committed and back-patched.

... and it looks like some of the back-branches are failing for Windows.
I'm assuming this is because c290e79 was only back-patched to v15.  My
first instinct is just to back-patch that one all the way to v11, but maybe
there's an alternative involving #ifdef WIN32.  Are there any concerns with
back-patching c290e79?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: stopgap fix for signal handling during restore_command

From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes:
> ... and it looks like some of the back-branches are failing for Windows.
> I'm assuming this is because c290e79 was only back-patched to v15.  My
> first instinct is just to back-patch that one all the way to v11, but maybe
> there's an alternative involving #ifdef WIN32.  Are there any concerns with
> back-patching c290e79?

Sounds fine to me.

            regards, tom lane



Re: stopgap fix for signal handling during restore_command

From
Nathan Bossart
Date:
On Tue, Oct 17, 2023 at 12:47:29PM -0400, Tom Lane wrote:
> Nathan Bossart <nathandbossart@gmail.com> writes:
>> ... and it looks like some of the back-branches are failing for Windows.
>> I'm assuming this is because c290e79 was only back-patched to v15.  My
>> first instinct is just to back-patch that one all the way to v11, but maybe
>> there's an alternative involving #ifdef WIN32.  Are there any concerns with
>> back-patching c290e79?
> 
> Sounds fine to me.

Thanks, done.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com