Thread: Using WaitEventSet in the postmaster

Using WaitEventSet in the postmaster

From
Thomas Munro
Date:
Hi,

Here's a work-in-progress patch that uses WaitEventSet for the main
event loop in the postmaster, with a latch as the wakeup mechanism for
"PM signals" (requests from backends to do things like start a
background worker, etc).  There are still raw signals that are part of
the external interface (SIGHUP etc), but those handlers just set a
flag and set the latch, instead of doing the state machine work.  Some
advantages I can think of:

1.  Inherits various slightly more efficient modern kernel APIs for
multiplexing.
2.  Will automatically benefit from later improvements to WaitEventSet.
3.  Looks much more like the rest of our code.
4.  Requires no obscure signal programming knowledge to understand.
5.  Removes the strange call stacks we have, where most of postgres is
forked from inside a signal handler.
6.  Might help with weirdness and bugs in some signal implementations
(Cygwin, NetBSD?).
7.  Removes the need to stat() PROMOTE_SIGNAL_FILE and
LOGROTATE_SIGNAL_FILE whenever PM signals are sent, now that SIGUSR1
is less overloaded.
8.  It's a small step towards removing the need to emulate signals on Windows.

In order to avoid adding a new dependency on the contents of shared
memory, I introduced SetLatchRobustly() that will always use the slow
path kernel wakeup primitive, even in cases where SetLatch() would
not.  The idea here is that if one backend trashes shared memory,
others backends can still wake the postmaster even though it may
appear that the postmaster isn't waiting or the latch is already set.
It would be possible to go further and have a robust wait mode that
doesn't read is_set too.  It was indecision here that stopped me
proposing this sooner...

One thing that might need more work is cleanup of the PM's WES in
child processes.  Also I noticed in passing that the Windows kernel
event handles for latches are probably leaked on crash-reinit, but
that was already true, this just adds one more.  Also the way I re-add
the latch every time through the event loop in case there was a
crash-reinit is stupid, I'll tidy that up.

This is something I extracted and rejuvenated from a larger set of
patches I was hacking on a year or two ago to try to get rid of lots
of users of raw signals.  The recent thread about mamba's struggles
and the possibility that latches might help reminded me to dust this
part off, and potentially avoid some duplicated effort.

I'm not saying this is free of bugs, but it's passing on CI and seems
like enough to share and see what people think.

(Some other ideas I thought about back then: we could invent
WL_SIGNAL, and not need all those global flag variables or the
handlers that set the latch.  For eg kqueue it's trivial, and for
ancient Unixes you could do a sort of home-made signalfd with a single
generic signal handler that just does write(self_pipe, &siginfo,
sizeof(siginfo)).  But that starts to seems like refactoring for
refactoring's sake; that's probably how I'd write a native kqueue
program, but it's not even that obvious to me that we really should be
pretending that Windows has signals, which put me off that idea.
Perhaps in a post-fake-signals world we just have the postmaster's
event loop directly consume commands from pg_ctl from a control pipe?
Too many decisions at once, I gave that line of thinking up for now.)

Attachment

Re: Using WaitEventSet in the postmaster

From
Andres Freund
Date:
Hi,

On 2022-12-02 10:12:25 +1300, Thomas Munro wrote:
> Here's a work-in-progress patch that uses WaitEventSet for the main
> event loop in the postmaster

Wee!


> with a latch as the wakeup mechanism for "PM signals" (requests from
> backends to do things like start a background worker, etc).

Hm - is that directly related? ISTM that using a WES in the main loop, and
changing pmsignal.c to a latch are somewhat separate things?

Using a latch for pmsignal.c seems like a larger lift, because it means that
all of latch.c needs to be robust against a corrupted struct Latch.


> In order to avoid adding a new dependency on the contents of shared
> memory, I introduced SetLatchRobustly() that will always use the slow
> path kernel wakeup primitive, even in cases where SetLatch() would
> not.  The idea here is that if one backend trashes shared memory,
> others backends can still wake the postmaster even though it may
> appear that the postmaster isn't waiting or the latch is already set.

Why is that a concern that needs to be addressed?


ISTM that the important thing is that either a) the postmaster's latch can't
be corrupted, because it's not shared with backends or b) struct Latch can be
overwritten with random contents without causing additional problems in
postmaster.

I don't think b) is the case as the patch stands. Imagine some process
overwriting pm_latch->owner_pid. That'd then break the SetLatch() in
postmaster's signal handler, because it wouldn't realize that itself needs to
be woken up, and we'd just signal some random process.


It doesn't seem trivial (but not impossible either) to make SetLatch() robust
against arbitrary corruption. So it seems easier to me to just put the latch
in process local memory, and do a SetLatch() in postmaster's SIGUSR1 handler.

Greetings,

Andres Freund



Re: Using WaitEventSet in the postmaster

From
Thomas Munro
Date:
On Fri, Dec 2, 2022 at 2:40 PM Andres Freund <andres@anarazel.de> wrote:
> On 2022-12-02 10:12:25 +1300, Thomas Munro wrote:
> > with a latch as the wakeup mechanism for "PM signals" (requests from
> > backends to do things like start a background worker, etc).
>
> Hm - is that directly related? ISTM that using a WES in the main loop, and
> changing pmsignal.c to a latch are somewhat separate things?

Yeah, that's a good question.  This comes from a larger patch set
where my *goal* was to use latches everywhere possible for
interprocess wakeups, but it does indeed make a lot of sense to do the
postmaster WaitEventSet retrofit completely independently of that, and
leaving the associated robustness problems for later proposals (the
posted patch clearly fails to solve them).

> I don't think b) is the case as the patch stands. Imagine some process
> overwriting pm_latch->owner_pid. That'd then break the SetLatch() in
> postmaster's signal handler, because it wouldn't realize that itself needs to
> be woken up, and we'd just signal some random process.

Right.  At some point I had an idea about a non-shared table of
latches where OS-specific things like pids and HANDLEs live, so only
the maybe_waiting and is_set flags are in shared memory, and even
those are ignored when accessing the latch in 'robust' mode (they're
only optimisations after all).  I didn't try it though.  First you
might have to switch to a model with a finite set of latches
identified by index, or something like that.  But I like your idea of
separating that whole problem.

> It doesn't seem trivial (but not impossible either) to make SetLatch() robust
> against arbitrary corruption. So it seems easier to me to just put the latch
> in process local memory, and do a SetLatch() in postmaster's SIGUSR1 handler.

Alright, good idea, I'll do a v2 like that.



Re: Using WaitEventSet in the postmaster

From
Thomas Munro
Date:
On Fri, Dec 2, 2022 at 3:36 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Dec 2, 2022 at 2:40 PM Andres Freund <andres@anarazel.de> wrote:
> > It doesn't seem trivial (but not impossible either) to make SetLatch() robust
> > against arbitrary corruption. So it seems easier to me to just put the latch
> > in process local memory, and do a SetLatch() in postmaster's SIGUSR1 handler.
>
> Alright, good idea, I'll do a v2 like that.

Here's an iteration like that.  Still WIP grade.  It passes, but there
must be something I don't understand about this computer program yet,
because if I move the "if (pending_..." section up into the block
where WL_LATCH_SET has arrived (instead of testing those variables
every time through the loop), a couple of tests leave zombie
(unreaped) processes behind, indicating that something funky happened
to the state machine that I haven't yet grokked.  Will look more next
week.

By the way, I think if we do this and then also do
s/select(/WaitLatchOrSocket(/ in auth.c's RADIUS code, then we could
then drop a chunk of newly unreachable code in
src/backend/port/win32/socket.c (though maybe I missed something; it's
quite hard to grep for "select" in a SQL database :-D).  There's also
a bunch of suspect stuff in there about UDP that is already dead
thanks to the pgstats work.

Attachment

Re: Using WaitEventSet in the postmaster

From
Thomas Munro
Date:
On Sat, Dec 3, 2022 at 10:41 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Here's an iteration like that.  Still WIP grade.  It passes, but there
> must be something I don't understand about this computer program yet,
> because if I move the "if (pending_..." section up into the block
> where WL_LATCH_SET has arrived (instead of testing those variables
> every time through the loop), a couple of tests leave zombie
> (unreaped) processes behind, indicating that something funky happened
> to the state machine that I haven't yet grokked.  Will look more next
> week.

Duh.  The reason for that was the pre-existing special case for
PM_WAIT_DEAD_END, which used a sleep(100ms) loop to wait for children
to exit, which I needed to change to a latch wait.  Fixed in the next
iteration, attached.

The reason for the existing sleep-based approach was that we didn't
want to accept any more connections (or spin furiously because the
listen queue was non-empty).  So in this version I invented a way to
suppress socket events temporarily with WL_SOCKET_IGNORE, and then
reactivate them after crash reinit.

Still WIP, but I hope travelling in the right direction.

Attachment

Re: Using WaitEventSet in the postmaster

From
Andres Freund
Date:
Hi,

On 2022-12-05 22:45:57 +1300, Thomas Munro wrote:
> On Sat, Dec 3, 2022 at 10:41 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Here's an iteration like that.  Still WIP grade.  It passes, but there
> > must be something I don't understand about this computer program yet,
> > because if I move the "if (pending_..." section up into the block
> > where WL_LATCH_SET has arrived (instead of testing those variables
> > every time through the loop), a couple of tests leave zombie
> > (unreaped) processes behind, indicating that something funky happened
> > to the state machine that I haven't yet grokked.  Will look more next
> > week.
> 
> Duh.  The reason for that was the pre-existing special case for
> PM_WAIT_DEAD_END, which used a sleep(100ms) loop to wait for children
> to exit, which I needed to change to a latch wait.  Fixed in the next
> iteration, attached.
> 
> The reason for the existing sleep-based approach was that we didn't
> want to accept any more connections (or spin furiously because the
> listen queue was non-empty).  So in this version I invented a way to
> suppress socket events temporarily with WL_SOCKET_IGNORE, and then
> reactivate them after crash reinit.

That seems like an odd special flag. Why do we need it? Is it just because we
want to have assertions ensuring that something is being queried?


>  * WL_SOCKET_ACCEPT is a new event for an incoming connection (on Unix,
>    this is just another name for WL_SOCKET_READABLE, but Window has a
>    different underlying event; this mirrors WL_SOCKET_CONNECTED on the
>    other end of a connection)

Perhaps worth committing separately and soon? Seems pretty uncontroversial
from here.


> +/*
> + * Object representing the state of a postmaster.
> + *
> + * XXX Lots of global variables could move in here.
> + */
> +typedef struct
> +{
> +    WaitEventSet    *wes;
> +} Postmaster;
> +

Seems weird to introduce this but then basically have it be unused. I'd say
either have a preceding patch move at least a few members into it, or just
omit it for now.


> +    /* This may configure SIGURG, depending on platform. */
> +    InitializeLatchSupport();
> +    InitLocalLatch();

I'm mildly preferring InitProcessLocalLatch(), but not sure why - there's not
really a conflicting meaning of "local" here.


> +/*
> + * Initialize the WaitEventSet we'll use in our main event loop.
> + */
> +static void
> +InitializeWaitSet(Postmaster *postmaster)
> +{
> +    /* Set up a WaitEventSet for our latch and listening sockets. */
> +    postmaster->wes = CreateWaitEventSet(CurrentMemoryContext, 1 + MAXLISTEN);
> +    AddWaitEventToSet(postmaster->wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
> +    for (int i = 0; i < MAXLISTEN; i++)
> +    {
> +        int            fd = ListenSocket[i];
> +
> +        if (fd == PGINVALID_SOCKET)
> +            break;
> +        AddWaitEventToSet(postmaster->wes, WL_SOCKET_ACCEPT, fd, NULL, NULL);
> +    }
> +}

The naming seems like it could be confused with latch.h
infrastructure. InitPostmasterWaitSet()?


> +/*
> + * Activate or deactivate the server socket events.
> + */
> +static void
> +AdjustServerSocketEvents(Postmaster *postmaster, bool active)
> +{
> +    for (int pos = 1; pos < GetNumRegisteredWaitEvents(postmaster->wes); ++pos)
> +        ModifyWaitEvent(postmaster->wes,
> +                        pos, active ? WL_SOCKET_ACCEPT : WL_SOCKET_IGNORE,
> +                        NULL);
> +}

This seems to hardcode the specific wait events we're waiting for based on
latch.c infrastructure. Not really convinced that's a good idea.

> +        /*
> +         * Latch set by signal handler, or new connection pending on any of our
> +         * sockets? If the latter, fork a child process to deal with it.
> +         */
> +        for (int i = 0; i < nevents; i++)
>          {
> -            if (errno != EINTR && errno != EWOULDBLOCK)
> +            if (events[i].events & WL_LATCH_SET)
>              {
> -                ereport(LOG,
> -                        (errcode_for_socket_access(),
> -                         errmsg("select() failed in postmaster: %m")));
> -                return STATUS_ERROR;
> +                ResetLatch(MyLatch);
> +
> +                /* Process work scheduled by signal handlers. */
> +                if (pending_action_request)
> +                    process_action_request(postmaster);
> +                if (pending_child_exit)
> +                    process_child_exit(postmaster);
> +                if (pending_reload_request)
> +                    process_reload_request();
> +                if (pending_shutdown_request)
> +                    process_shutdown_request(postmaster);
>              }

Is the order of operations here quite right? Shouldn't we process a shutdown
request before the others? And a child exit before the request to start an
autovac worker etc?

ISTM it should be 1) shutdown request 2) child exit 3) config reload 4) action
request.


>  /*
> - * pmdie -- signal handler for processing various postmaster signals.
> + * pg_ctl uses SIGTERM, SIGINT and SIGQUIT to request different types of
> + * shutdown.
>   */
>  static void
> -pmdie(SIGNAL_ARGS)
> +handle_shutdown_request_signal(SIGNAL_ARGS)
>  {
> -    int            save_errno = errno;
> -
> -    ereport(DEBUG2,
> -            (errmsg_internal("postmaster received signal %d",
> -                             postgres_signal_arg)));
> +    int save_errno = errno;
>  
>      switch (postgres_signal_arg)
>      {
>          case SIGTERM:
> +            pending_shutdown_request = SmartShutdown;
> +            break;
> +        case SIGINT:
> +            pending_shutdown_request = FastShutdown;
> +            break;
> +        case SIGQUIT:
> +            pending_shutdown_request = ImmediateShutdown;
> +            break;
> +    }
> +    SetLatch(MyLatch);
> +
> +    errno = save_errno;
> +}

Hm, not having the "postmaster received signal" message anymore seems like a
loss when debugging things. I think process_shutdown_request() should emit
something like it.

I wonder if we should have a elog_sighand() that's written to be signal
safe. I've written versions of that numerous times for debugging, and it's a
bit silly to do that over and over again.



> @@ -2905,23 +2926,33 @@ pmdie(SIGNAL_ARGS)
>               * Now wait for backends to exit.  If there are none,
>               * PostmasterStateMachine will take the next step.
>               */
> -            PostmasterStateMachine();
> +            PostmasterStateMachine(postmaster);
>              break;

I'm by now fairly certain that it's a bad idea to have this change mixed in
with the rest of this large-ish change.


>  static void
> -PostmasterStateMachine(void)
> +PostmasterStateMachine(Postmaster *postmaster)
>  {
>      /* If we're doing a smart shutdown, try to advance that state. */
>      if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
> @@ -3819,6 +3849,9 @@ PostmasterStateMachine(void)
>              Assert(AutoVacPID == 0);
>              /* syslogger is not considered here */
>              pmState = PM_NO_CHILDREN;
> +
> +            /* re-activate server socket events */
> +            AdjustServerSocketEvents(postmaster, true);
>          }
>      }
>  
> @@ -3905,6 +3938,9 @@ PostmasterStateMachine(void)
>          pmState = PM_STARTUP;
>          /* crash recovery started, reset SIGKILL flag */
>          AbortStartTime = 0;
> +
> +        /* start accepting server socket connection events again */
> +        reenable_server_socket_events = true;
>      }
>  }

I don't think reenable_server_socket_events does anything as the patch stands
- I don't see it being checked anywhere?  And in the path above, you're using
AdjustServerSocketEvents() directly.


> @@ -4094,6 +4130,7 @@ BackendStartup(Port *port)
>      /* Hasn't asked to be notified about any bgworkers yet */
>      bn->bgworker_notify = false;
>  
> +    PG_SETMASK(&BlockSig);
>  #ifdef EXEC_BACKEND
>      pid = backend_forkexec(port);
>  #else                            /* !EXEC_BACKEND */

There are other calls to fork_process() - why don't they need the same
treatment?

Perhaps we should add an assertion to fork_process() ensuring that signals are
masked?


> diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
> index eb3a569aae..3bfef592eb 100644
> --- a/src/backend/storage/ipc/latch.c
> +++ b/src/backend/storage/ipc/latch.c
> @@ -283,6 +283,17 @@ InitializeLatchSupport(void)
>  #ifdef WAIT_USE_SIGNALFD
>      sigset_t    signalfd_mask;
>  
> +    if (IsUnderPostmaster)
> +    {
> +        if (signal_fd != -1)
> +        {
> +            /* Release postmaster's signal FD; ignore any error */
> +            (void) close(signal_fd);
> +            signal_fd = -1;
> +            ReleaseExternalFD();
> +        }
> +    }
> +

Hm - arguably it's a bug that we don't do this right now, correct?

> @@ -1201,6 +1214,7 @@ WaitEventAdjustKqueue(WaitEventSet *set, WaitEvent *event, int old_events)
>             event->events == WL_POSTMASTER_DEATH ||
>             (event->events & (WL_SOCKET_READABLE |
>                               WL_SOCKET_WRITEABLE |
> +                             WL_SOCKET_IGNORE |
>                               WL_SOCKET_CLOSED)));
>  
>      if (event->events == WL_POSTMASTER_DEATH)
> @@ -1312,6 +1326,8 @@ WaitEventAdjustWin32(WaitEventSet *set, WaitEvent *event)
>              flags |= FD_WRITE;
>          if (event->events & WL_SOCKET_CONNECTED)
>              flags |= FD_CONNECT;
> +        if (event->events & WL_SOCKET_ACCEPT)
> +            flags |= FD_ACCEPT;
>  
>          if (*handle == WSA_INVALID_EVENT)
>          {

I wonder if the code would end up easier to understand if we handled
WL_SOCKET_CONNECTED, WL_SOCKET_ACCEPT explicitly in the !WIN32 cases, rather
than redefining it to WL_SOCKET_READABLE.



> diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
> index 3082093d1e..655e881688 100644
> --- a/src/backend/tcop/postgres.c
> +++ b/src/backend/tcop/postgres.c
> @@ -24,7 +24,6 @@
>  #include <signal.h>
>  #include <unistd.h>
>  #include <sys/resource.h>
> -#include <sys/select.h>
>  #include <sys/socket.h>
>  #include <sys/time.h>

Do you know why this include even existed?


Greetings,

Andres Freund



Re: Using WaitEventSet in the postmaster

From
Thomas Munro
Date:
On Tue, Dec 6, 2022 at 7:09 AM Andres Freund <andres@anarazel.de> wrote:
> On 2022-12-05 22:45:57 +1300, Thomas Munro wrote:
> > The reason for the existing sleep-based approach was that we didn't
> > want to accept any more connections (or spin furiously because the
> > listen queue was non-empty).  So in this version I invented a way to
> > suppress socket events temporarily with WL_SOCKET_IGNORE, and then
> > reactivate them after crash reinit.
>
> That seems like an odd special flag. Why do we need it? Is it just because we
> want to have assertions ensuring that something is being queried?

Yeah.  Perhaps 0 would be a less clumsy way to say "no events please".
I removed the assertions and did it that way in this next iteration.

I realised that the previous approach didn't actually suppress POLLHUP
and POLLERR in the poll and epoll implementations (even though our
code seems to think it needs to ask for those events, it's not
necessary, you get them anyway), and, being level-triggered, if those
were ever reported we'd finish up pegging the CPU to 100% until the
children exited.  Unlikely to happen with a server socket, but wrong
on principle, and maybe a problem for other potential users of this
temporary event suppression mode.

One way to fix that for the epoll version is to EPOLL_CTL_DEL and
EPOLL_CTL_ADD, whenever transitioning to/from a zero event mask.
Tried like that in this version.  Another approach would be to
(finally) write DeleteWaitEvent() to do the same thing at a higher
level... seems overkill for this.

The kqueue version was already doing that because of the way it was
implemented, and the poll and Windows versions needed only a small
adjustment.  I'm not too sure about the Windows change; my two ideas
are passing the 0 through as shown in this version (not sure if it
really works the way I want, but it makes some sense and the
WSAEventSelect() call doesn't fail...), or sticking a dummy unsignaled
event in the array passed to WaitForMultipleObjects().

To make sure this code is exercised, I made the state machine code
eager about silencing the socket events during PM_WAIT_DEAD_END, so
crash TAP tests go through the cycle.  Regular non-crash shutdown also
runs EPOLL_CTL_DEL/EV_DELETE, which stands out if you trace the
postmaster.

> >  * WL_SOCKET_ACCEPT is a new event for an incoming connection (on Unix,
> >    this is just another name for WL_SOCKET_READABLE, but Window has a
> >    different underlying event; this mirrors WL_SOCKET_CONNECTED on the
> >    other end of a connection)
>
> Perhaps worth committing separately and soon? Seems pretty uncontroversial
> from here.

Alright, I split this into a separate patch.

> > +/*
> > + * Object representing the state of a postmaster.
> > + *
> > + * XXX Lots of global variables could move in here.
> > + */
> > +typedef struct
> > +{
> > +     WaitEventSet    *wes;
> > +} Postmaster;
> > +
>
> Seems weird to introduce this but then basically have it be unused. I'd say
> either have a preceding patch move at least a few members into it, or just
> omit it for now.

Alright, I'll just have to make a global variable wait_set for now to
keep things simple.

> > +     /* This may configure SIGURG, depending on platform. */
> > +     InitializeLatchSupport();
> > +     InitLocalLatch();
>
> I'm mildly preferring InitProcessLocalLatch(), but not sure why - there's not
> really a conflicting meaning of "local" here.

Done.

> > +/*
> > + * Initialize the WaitEventSet we'll use in our main event loop.
> > + */
> > +static void
> > +InitializeWaitSet(Postmaster *postmaster)
> > +{
> > +     /* Set up a WaitEventSet for our latch and listening sockets. */
> > +     postmaster->wes = CreateWaitEventSet(CurrentMemoryContext, 1 + MAXLISTEN);
> > +     AddWaitEventToSet(postmaster->wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
> > +     for (int i = 0; i < MAXLISTEN; i++)
> > +     {
> > +             int                     fd = ListenSocket[i];
> > +
> > +             if (fd == PGINVALID_SOCKET)
> > +                     break;
> > +             AddWaitEventToSet(postmaster->wes, WL_SOCKET_ACCEPT, fd, NULL, NULL);
> > +     }
> > +}
>
> The naming seems like it could be confused with latch.h
> infrastructure. InitPostmasterWaitSet()?

OK.

> > +/*
> > + * Activate or deactivate the server socket events.
> > + */
> > +static void
> > +AdjustServerSocketEvents(Postmaster *postmaster, bool active)
> > +{
> > +     for (int pos = 1; pos < GetNumRegisteredWaitEvents(postmaster->wes); ++pos)
> > +             ModifyWaitEvent(postmaster->wes,
> > +                                             pos, active ? WL_SOCKET_ACCEPT : WL_SOCKET_IGNORE,
> > +                                             NULL);
> > +}
>
> This seems to hardcode the specific wait events we're waiting for based on
> latch.c infrastructure. Not really convinced that's a good idea.

What are you objecting to?  The assumption that the first socket is at
position 1?  The use of GetNumRegisteredWaitEvents()?

> > +                             /* Process work scheduled by signal handlers. */
> > +                             if (pending_action_request)
> > +                                     process_action_request(postmaster);
> > +                             if (pending_child_exit)
> > +                                     process_child_exit(postmaster);
> > +                             if (pending_reload_request)
> > +                                     process_reload_request();
> > +                             if (pending_shutdown_request)
> > +                                     process_shutdown_request(postmaster);

> Is the order of operations here quite right? Shouldn't we process a shutdown
> request before the others? And a child exit before the request to start an
> autovac worker etc?
>
> ISTM it should be 1) shutdown request 2) child exit 3) config reload 4) action
> request.

OK, reordered like that.

> > -     ereport(DEBUG2,
> > -                     (errmsg_internal("postmaster received signal %d",
> > -                                                      postgres_signal_arg)));

> Hm, not having the "postmaster received signal" message anymore seems like a
> loss when debugging things. I think process_shutdown_request() should emit
> something like it.

I added some of these.

> I wonder if we should have a elog_sighand() that's written to be signal
> safe. I've written versions of that numerous times for debugging, and it's a
> bit silly to do that over and over again.

Right, I was being dogmatic about kicking everything that doesn't have
a great big neon "async-signal-safe" sign on it out of the handlers.

> > +
> > +             /* start accepting server socket connection events again */
> > +             reenable_server_socket_events = true;
> >       }
> >  }
>
> I don't think reenable_server_socket_events does anything as the patch stands
> - I don't see it being checked anywhere?  And in the path above, you're using
> AdjustServerSocketEvents() directly.

Sorry, that was a left over unused variable from an earlier attempt,
which I only noticed after clicking send.  Removed.

> > @@ -4094,6 +4130,7 @@ BackendStartup(Port *port)
> >       /* Hasn't asked to be notified about any bgworkers yet */
> >       bn->bgworker_notify = false;
> >
> > +     PG_SETMASK(&BlockSig);
> >  #ifdef EXEC_BACKEND
> >       pid = backend_forkexec(port);
> >  #else                                                        /* !EXEC_BACKEND */
>
> There are other calls to fork_process() - why don't they need the same
> treatment?
>
> Perhaps we should add an assertion to fork_process() ensuring that signals are
> masked?

If we're going to put an assertion in there, we might as well consider
setting and restoring the mask in that wrapper.  Tried like that in
this version.

> > diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
> > index eb3a569aae..3bfef592eb 100644
> > --- a/src/backend/storage/ipc/latch.c
> > +++ b/src/backend/storage/ipc/latch.c
> > @@ -283,6 +283,17 @@ InitializeLatchSupport(void)
> >  #ifdef WAIT_USE_SIGNALFD
> >       sigset_t        signalfd_mask;
> >
> > +     if (IsUnderPostmaster)
> > +     {
> > +             if (signal_fd != -1)
> > +             {
> > +                     /* Release postmaster's signal FD; ignore any error */
> > +                     (void) close(signal_fd);
> > +                     signal_fd = -1;
> > +                     ReleaseExternalFD();
> > +             }
> > +     }
> > +
>
> Hm - arguably it's a bug that we don't do this right now, correct?

Yes, I would say it's a non-live bug.  A signalfd descriptor inherited
by a child process isn't dangerous (it doesn't see the parent's
signals, it sees the child's signals), but it's a waste because we'd
leak it.  I guess we could re-use it instead but that seems a little
weird.  I've put this into a separate commit in case someone wants to
argue for back-patching, but it's a pretty hypothetical concern since
the postmaster never initialised latch support before...

One thing that does seem a bit odd to me, though, is why we're
cleaning up inherited descriptors in a function called
InitializeLatchSupport().  I wonder if we should move it into
FreeLatchSupportAfterFork()?

We should also close the postmaster's epoll fd, so I invented
FreeWaitEventSetAfterFork().  I found that ClosePostmasterPorts() was
a good place to call that, though it doesn't really fit the name of
that function too well...

> > @@ -1201,6 +1214,7 @@ WaitEventAdjustKqueue(WaitEventSet *set, WaitEvent *event, int old_events)
> >                  event->events == WL_POSTMASTER_DEATH ||
> >                  (event->events & (WL_SOCKET_READABLE |
> >                                                        WL_SOCKET_WRITEABLE |
> > +                                                      WL_SOCKET_IGNORE |
> >                                                        WL_SOCKET_CLOSED)));
> >
> >       if (event->events == WL_POSTMASTER_DEATH)
> > @@ -1312,6 +1326,8 @@ WaitEventAdjustWin32(WaitEventSet *set, WaitEvent *event)
> >                       flags |= FD_WRITE;
> >               if (event->events & WL_SOCKET_CONNECTED)
> >                       flags |= FD_CONNECT;
> > +             if (event->events & WL_SOCKET_ACCEPT)
> > +                     flags |= FD_ACCEPT;
> >
> >               if (*handle == WSA_INVALID_EVENT)
> >               {
>
> I wonder if the code would end up easier to understand if we handled
> WL_SOCKET_CONNECTED, WL_SOCKET_ACCEPT explicitly in the !WIN32 cases, rather
> than redefining it to WL_SOCKET_READABLE.

Yeah maybe we could try that separately.

> > diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
> > index 3082093d1e..655e881688 100644
> > --- a/src/backend/tcop/postgres.c
> > +++ b/src/backend/tcop/postgres.c
> > @@ -24,7 +24,6 @@
> >  #include <signal.h>
> >  #include <unistd.h>
> >  #include <sys/resource.h>
> > -#include <sys/select.h>
> >  #include <sys/socket.h>
> >  #include <sys/time.h>
>
> Do you know why this include even existed?

That turned out to be a fun question to answer: apparently there used
to be an optional 'multiplexed backend' mode, removed by commit
d5bbe2aca5 in 1998.  A single backend could be connected to multiple
frontends.

Attachment

Re: Using WaitEventSet in the postmaster

From
Andres Freund
Date:
On 2022-12-07 00:58:06 +1300, Thomas Munro wrote:
> One way to fix that for the epoll version is to EPOLL_CTL_DEL and
> EPOLL_CTL_ADD, whenever transitioning to/from a zero event mask.
> Tried like that in this version.  Another approach would be to
> (finally) write DeleteWaitEvent() to do the same thing at a higher
> level... seems overkill for this.

What about just recreating the WES during crash restart?


> > This seems to hardcode the specific wait events we're waiting for based on
> > latch.c infrastructure. Not really convinced that's a good idea.
>
> What are you objecting to?  The assumption that the first socket is at
> position 1?  The use of GetNumRegisteredWaitEvents()?

The latter.



Re: Using WaitEventSet in the postmaster

From
Justin Pryzby
Date:
> From 61480441f67ca7fac96ca4bcfe85f27013a47aa8 Mon Sep 17 00:00:00 2001
> From: Thomas Munro <thomas.munro@gmail.com>
> Date: Tue, 6 Dec 2022 16:13:36 +1300
> Subject: [PATCH v4 2/5] Don't leak a signalfd when using latches in the
>  postmaster.
> 
> +        /*
> +         * It would probably be safe to re-use the inherited signalfd since
> +         * signalfds only see the current processes pending signals, but it

I think you mean "current process's", right ?



Re: Using WaitEventSet in the postmaster

From
Thomas Munro
Date:
On Wed, Dec 7, 2022 at 12:12 PM Andres Freund <andres@anarazel.de> wrote:
> On 2022-12-07 00:58:06 +1300, Thomas Munro wrote:
> > One way to fix that for the epoll version is to EPOLL_CTL_DEL and
> > EPOLL_CTL_ADD, whenever transitioning to/from a zero event mask.
> > Tried like that in this version.  Another approach would be to
> > (finally) write DeleteWaitEvent() to do the same thing at a higher
> > level... seems overkill for this.
>
> What about just recreating the WES during crash restart?

It seems a bit like cheating but yeah that's a super simple solution,
and removes one patch from the stack.  Done like that in this version.

> > > This seems to hardcode the specific wait events we're waiting for based on
> > > latch.c infrastructure. Not really convinced that's a good idea.
> >
> > What are you objecting to?  The assumption that the first socket is at
> > position 1?  The use of GetNumRegisteredWaitEvents()?
>
> The latter.

Removed.

Attachment

Re: Using WaitEventSet in the postmaster

From
Thomas Munro
Date:
On Wed, Dec 7, 2022 at 2:08 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > +             /*
> > +              * It would probably be safe to re-use the inherited signalfd since
> > +              * signalfds only see the current processes pending signals, but it
>
> I think you mean "current process's", right ?

Fixed in v5, thanks.



Re: Using WaitEventSet in the postmaster

From
Andres Freund
Date:
Hi,

On 2022-12-07 14:12:37 +1300, Thomas Munro wrote:
> On Wed, Dec 7, 2022 at 12:12 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2022-12-07 00:58:06 +1300, Thomas Munro wrote:
> > > One way to fix that for the epoll version is to EPOLL_CTL_DEL and
> > > EPOLL_CTL_ADD, whenever transitioning to/from a zero event mask.
> > > Tried like that in this version.  Another approach would be to
> > > (finally) write DeleteWaitEvent() to do the same thing at a higher
> > > level... seems overkill for this.
> >
> > What about just recreating the WES during crash restart?
> 
> It seems a bit like cheating but yeah that's a super simple solution,
> and removes one patch from the stack.  Done like that in this version.

I somewhat wish we'd do that more aggressively during crash-restart, rather
than the opposite. Mostly around shared memory contents though, so perhaps
that's not that comparable...

Greetings,

Andres Freund



Re: Using WaitEventSet in the postmaster

From
Thomas Munro
Date:
I pushed the small preliminary patches.  Here's a rebase of the main patch.

Attachment

Re: Using WaitEventSet in the postmaster

From
Thomas Munro
Date:
On Fri, Dec 23, 2022 at 8:46 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> I pushed the small preliminary patches.  Here's a rebase of the main patch.

Here are some questions I have considered.  Anyone got an opinion
on point 3, in particular?

1.  Is it OK that we are now using APIs that might throw, in places
where we weren't?  I think so: we don't really expect WaitEventSet
APIs to throw unless something is pretty seriously wrong, and if you
hack things to inject failures there then you get a FATAL error and
the postmaster exits and the children detect that.  I think that is
appropriate.

2.  Is it really OK to delete the pqsignal_pm() infrastructure?  I
think so.  The need for sa_mask to block all signals is gone: all
signal handlers should now be re-entrant (ie safe in case of
interruption while already in a signal handler), safe against stack
overflow (pqsignal() still blocks re-entry for the *same* signal
number, because we use sigaction() without SA_NODEFER, so a handler
can only be interrupted by a different signal, and the number of
actions installed is finite and small), and safe to run at any time
(ie safe to interrupt the user context because we just do known-good
sigatomic_t/syscall stuff and save/restore errno).  The concern about
SA_RESTART is gone, because we no longer use the underspecified
select() interface; the replacement implementation syscalls, even
poll(), return with EINTR for handlers installed with SA_RESTART, but
that's now moot anyway because we have a latch that guarantees they
return with a different event anyway.  (FTR select() is nearly extinct
in BE code, I found one other user and I plan to remove it, see RADIUS
thread, CF #4103.)

3.  Is it OK to clobber the shared pending flag for SIGQUIT, SIGTERM,
SIGINT?  If you send all of these extremely rapidly, it's
indeterminate which one will be seen by handle_shutdown_request().  I
think that's probably acceptable?  To be strict about processing only
the first one that is delivered, I think you'd need an sa_mask to
block all three signals, and then you wouldn't change
pending_shutdown_request if it's already set, which I'm willing to
code up if someone thinks that's important.  (<vapourware>Ideally I
would invent WL_SIGNAL to consume signal events serially without
handlers or global variables</vapourware>.)

4.  Is anything new leaking into child processes due to this new
infrastructure?  I don't think so; the postmaster's MemoryContext is
destroyed, and before that I'm releasing kernel resources on OSes that
need it (namely Linux, where the epoll fd and signalfd need to be
closed).

5.  Is the signal mask being correctly handled during forking?  I
think so: I decided to push the masking logic directly into the
routine that forks, to make it easy to verify that all paths set the
mask the way we want.  (While thinking about that I noticed that
signals don't seem to be initially masked on Windows; I think that's a
pre-existing condition, and I assume we get away with it because
nothing reaches the fake signal dispatch code soon enough to break
anything?  Not changing that in this patch.)

6.  Is the naming and style OK?  Better ideas welcome, but basically I
tried to avoid all unnecessary refactoring and changes, so no real
logic moves around, and the changes are pretty close to "mechanical".
One bikeshed decision was what to call the {handle,process}_XXX
functions and associated flags.  Maybe "action" isn't the best name;
but it could be a request from pg_ctl or a request from a child
process.  I went with newly invented names for these handlers rather
than "handle_SIGUSR1" etc because (1) the 3 different shutdown request
signals point to a common handler and (2) I hope to switch to latches
instead of SIGUSR1 for "action" in later work.  But I could switch to
got_SIGUSR1 style variables if someone thinks it's better.

Here's a new version, with small changes:
* remove a stray reference to select() in a pqcomm.c comment
* move PG_SETMASK(&UnBlockSig) below the bit that sets up SIGTTIN etc
* pgindent

Attachment

Re: Using WaitEventSet in the postmaster

From
Andres Freund
Date:
Hi,

On 2023-01-07 11:08:36 +1300, Thomas Munro wrote:
> 1.  Is it OK that we are now using APIs that might throw, in places
> where we weren't?  I think so: we don't really expect WaitEventSet
> APIs to throw unless something is pretty seriously wrong, and if you
> hack things to inject failures there then you get a FATAL error and
> the postmaster exits and the children detect that.  I think that is
> appropriate.

I think it's ok in principle. It might be that we'll find some thing to fix in
the future, but I don't see anything fundamental or obvious.


> 2.  Is it really OK to delete the pqsignal_pm() infrastructure?  I
> think so.

Same.


> 3.  Is it OK to clobber the shared pending flag for SIGQUIT, SIGTERM,
> SIGINT?  If you send all of these extremely rapidly, it's
> indeterminate which one will be seen by handle_shutdown_request().

That doesn't seem optimal. I'm mostly worried that we can end up downgrading a
shutdown request.


> I think that's probably acceptable?  To be strict about processing only the
> first one that is delivered, I think you'd need an sa_mask to block all
> three signals, and then you wouldn't change pending_shutdown_request if it's
> already set, which I'm willing to code up if someone thinks that's
> important.  (<vapourware>Ideally I would invent WL_SIGNAL to consume signal
> events serially without handlers or global variables</vapourware>.)

Hm. The need for blocking sa_mask solely comes from using one variable in
three signal handlers, right?  It's not pretty, but to me the easiest fix here
seems to be have separate pending_{fast,smart,immediate}_shutdown_request
variables and deal with them in process_shutdown_request().  Might still make
sense to have one pending_shutdown_request variable, to avoid unnecessary
branches before calling process_shutdown_request().


> 5.  Is the signal mask being correctly handled during forking?  I
> think so: I decided to push the masking logic directly into the
> routine that forks, to make it easy to verify that all paths set the
> mask the way we want.

Hm. If I understand correctly, you used sigprocmask() directly (vs
PG_SETMASK()) in fork_process() because you want the old mask? But why do we
restore the prior mask, instead of just using PG_SETMASK(&UnBlockSig); as we
still do in a bunch of places in the postmaster?

Not that I'm advocating for that, but would there be any real harm in just
continuing to accept signals post fork? Now all the signal handlers should
just end up pointlessly setting a local variable that's not going to be read
any further?  If true, it'd be good to add a comment explaining that this is
just a belt-and-suspenders thing.


> (While thinking about that I noticed that signals don't seem to be initially
> masked on Windows; I think that's a pre-existing condition, and I assume we
> get away with it because nothing reaches the fake signal dispatch code soon
> enough to break anything?  Not changing that in this patch.)

It's indeed a bit odd that we do pgwin32_signal_initialize() before the
initmask() and PG_SETMASK(&BlockSig) in InitPostmasterChild(). I guess it's
kinda harmless though?


I'm now somewhat weirded out by the choice to do pg_strong_random_init() in
fork_process() rather than InitPostmasterChild(). Seems odd.


> 6.  Is the naming and style OK?  Better ideas welcome, but basically I
> tried to avoid all unnecessary refactoring and changes, so no real
> logic moves around, and the changes are pretty close to "mechanical".
> One bikeshed decision was what to call the {handle,process}_XXX
> functions and associated flags.

I wonder if it'd be good to have a _pm_ in the name.


> Maybe "action" isn't the best name;

Yea, I don't like it. A shutdown is also an action, etc. What about just using
_pmsignal_? It's a it odd because there's two signals in the name, but it
still feels better than 'action' and better than the existing sigusr1_handler.


> +
> +/* I/O multiplexing object */
> +static WaitEventSet *wait_set;

I'd name it a bit more obviously connected to postmaster, particularly because
it does survive into forked processes and needs to be closed there.


> +/*
> + * Activate or deactivate notifications of server socket events.  Since we
> + * don't currently have a way to remove events from an existing WaitEventSet,
> + * we'll just destroy and recreate the whole thing.  This is called during
> + * shutdown so we can wait for backends to exit without accepting new
> + * connections, and during crash reinitialization when we need to start
> + * listening for new connections again.
> + */

I'd maybe reference that this gets cleaned up via ClosePostmasterPorts(), it's
not *immediately* obvious.


> +static void
> +ConfigurePostmasterWaitSet(bool accept_connections)
> +{
> +    if (wait_set)
> +        FreeWaitEventSet(wait_set);
> +    wait_set = NULL;
> +
> +    wait_set = CreateWaitEventSet(CurrentMemoryContext, 1 + MAXLISTEN);
> +    AddWaitEventToSet(wait_set, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);

Is there any reason to use MAXLISTEN here? We know how many sockets w're
listening on by this point, I think? No idea if the overhead matters anywhere,
but ...

I guess all the other code already does so, but afaict we don't dynamically
allocate resources there for things like ListenSocket[].



> -        /* Now check the select() result */
> -        if (selres < 0)
> -        {
> -            if (errno != EINTR && errno != EWOULDBLOCK)
> -            {
> -                ereport(LOG,
> -                        (errcode_for_socket_access(),
> -                         errmsg("select() failed in postmaster: %m")));
> -                return STATUS_ERROR;
> -            }
> -        }
> +        nevents = WaitEventSetWait(wait_set,
> +                                   timeout.tv_sec * 1000 + timeout.tv_usec / 1000,
> +                                   events,
> +                                   lengthof(events),
> +                                   0 /* postmaster posts no wait_events */ );
>  
>          /*
> -         * New connection pending on any of our sockets? If so, fork a child
> -         * process to deal with it.
> +         * Latch set by signal handler, or new connection pending on any of
> +         * our sockets? If the latter, fork a child process to deal with it.
>           */
> -        if (selres > 0)
> +        for (int i = 0; i < nevents; i++)
>          {

Hm. This is preexisting behaviour, but now it seems somewhat odd that we might
end up happily forking a backend for each socket without checking signals
inbetween.  Forking might take a while, and if a signal arrived since the
WaitEventSetWait() we'll not react to it.


>  static void
>  PostmasterStateMachine(void)
> @@ -3796,6 +3819,9 @@ PostmasterStateMachine(void)
>  
>      if (pmState == PM_WAIT_DEAD_END)
>      {
> +        /* Don't allow any new socket connection events. */
> +        ConfigurePostmasterWaitSet(false);

Hm. Is anything actually using the wait set until we re-create it with (true)
below?


Greetings,

Andres Freund



Re: Using WaitEventSet in the postmaster

From
Thomas Munro
Date:
On Sat, Jan 7, 2023 at 12:25 PM Andres Freund <andres@anarazel.de> wrote:
> On 2023-01-07 11:08:36 +1300, Thomas Munro wrote:
> > 3.  Is it OK to clobber the shared pending flag for SIGQUIT, SIGTERM,
> > SIGINT?  If you send all of these extremely rapidly, it's
> > indeterminate which one will be seen by handle_shutdown_request().
>
> That doesn't seem optimal. I'm mostly worried that we can end up downgrading a
> shutdown request.

I was contemplating whether I needed to do some more push-ups to
prefer the first delivered signal (instead of the last), but you're
saying that it would be enough to prefer the fastest shutdown type, in
cases where more than one signal was handled between server loops.
WFM.

> It's not pretty, but to me the easiest fix here
> seems to be have separate pending_{fast,smart,immediate}_shutdown_request
> variables and deal with them in process_shutdown_request().  Might still make
> sense to have one pending_shutdown_request variable, to avoid unnecessary
> branches before calling process_shutdown_request().

OK, tried that way.

> > 5.  Is the signal mask being correctly handled during forking?  I
> > think so: I decided to push the masking logic directly into the
> > routine that forks, to make it easy to verify that all paths set the
> > mask the way we want.
>
> Hm. If I understand correctly, you used sigprocmask() directly (vs
> PG_SETMASK()) in fork_process() because you want the old mask? But why do we
> restore the prior mask, instead of just using PG_SETMASK(&UnBlockSig); as we
> still do in a bunch of places in the postmaster?

It makes zero difference in practice but I think it's a nicer way to
code it because it doesn't make an unnecessary assumption about what
the signal mask was on entry.

> Not that I'm advocating for that, but would there be any real harm in just
> continuing to accept signals post fork? Now all the signal handlers should
> just end up pointlessly setting a local variable that's not going to be read
> any further?  If true, it'd be good to add a comment explaining that this is
> just a belt-and-suspenders thing.

Seems plausible and a nice idea to research.  I think it might take
some analysis of important signals that children might miss before
they install their own handlers.  Comment added.

> > 6.  Is the naming and style OK?  Better ideas welcome, but basically I
> > tried to avoid all unnecessary refactoring and changes, so no real
> > logic moves around, and the changes are pretty close to "mechanical".
> > One bikeshed decision was what to call the {handle,process}_XXX
> > functions and associated flags.
>
> I wonder if it'd be good to have a _pm_ in the name.

I dunno about this one, it's all static stuff in a file called
postmaster.c and one (now) already has pm in it (see below).

> > Maybe "action" isn't the best name;
>
> Yea, I don't like it. A shutdown is also an action, etc. What about just using
> _pmsignal_? It's a it odd because there's two signals in the name, but it
> still feels better than 'action' and better than the existing sigusr1_handler.

Done.

> > +
> > +/* I/O multiplexing object */
> > +static WaitEventSet *wait_set;
>
> I'd name it a bit more obviously connected to postmaster, particularly because
> it does survive into forked processes and needs to be closed there.

Done, as pm_wait_set.

> > +/*
> > + * Activate or deactivate notifications of server socket events.  Since we
> > + * don't currently have a way to remove events from an existing WaitEventSet,
> > + * we'll just destroy and recreate the whole thing.  This is called during
> > + * shutdown so we can wait for backends to exit without accepting new
> > + * connections, and during crash reinitialization when we need to start
> > + * listening for new connections again.
> > + */
>
> I'd maybe reference that this gets cleaned up via ClosePostmasterPorts(), it's
> not *immediately* obvious.

Done.

> > +static void
> > +ConfigurePostmasterWaitSet(bool accept_connections)
> > +{
> > +     if (wait_set)
> > +             FreeWaitEventSet(wait_set);
> > +     wait_set = NULL;
> > +
> > +     wait_set = CreateWaitEventSet(CurrentMemoryContext, 1 + MAXLISTEN);
> > +     AddWaitEventToSet(wait_set, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
>
> Is there any reason to use MAXLISTEN here? We know how many sockets w're
> listening on by this point, I think? No idea if the overhead matters anywhere,
> but ...

Fixed.

> >               /*
> > -              * New connection pending on any of our sockets? If so, fork a child
> > -              * process to deal with it.
> > +              * Latch set by signal handler, or new connection pending on any of
> > +              * our sockets? If the latter, fork a child process to deal with it.
> >                */
> > -             if (selres > 0)
> > +             for (int i = 0; i < nevents; i++)
> >               {
>
> Hm. This is preexisting behaviour, but now it seems somewhat odd that we might
> end up happily forking a backend for each socket without checking signals
> inbetween.  Forking might take a while, and if a signal arrived since the
> WaitEventSetWait() we'll not react to it.

Right, so if you have 64 server sockets and they all have clients
waiting on their listen queue, we'll service one connection from each
before getting back to checking for pmsignals or shutdown, and that's
unchanged in this patch.  (FWIW I also noticed that problem while
experimenting with the idea that you could handle multiple clients in
one go on OSes that report the listen queue depth size along with
WL_SOCKET_ACCEPT events, but didn't pursue it...)

I guess we could check every time through the nevents loop.  I may
look into that in a later patch, but I prefer to keep the same policy
in this patch.

> >  static void
> >  PostmasterStateMachine(void)
> > @@ -3796,6 +3819,9 @@ PostmasterStateMachine(void)
> >
> >       if (pmState == PM_WAIT_DEAD_END)
> >       {
> > +             /* Don't allow any new socket connection events. */
> > +             ConfigurePostmasterWaitSet(false);
>
> Hm. Is anything actually using the wait set until we re-create it with (true)
> below?

Yes.  While in PM_WAIT_DEAD_END state, waiting for children to exit,
there may be clients trying to connect.  On master, we have a special
pg_usleep(100000L) instead of select() just for that state so we can
ignore incoming connections while waiting for the SIGCHLD reaper to
advance our state, but in this new world that's not enough.  We need
to wait for the latch to be set by handle_child_exit_signal().  So I
used the regular WES to wait for the latch (that is, no more special
case for that state), but I need to ignore socket events.  If I
didn't, then an incoming connection sitting in the listen queue would
cause the server loop to burn 100% CPU reporting a level-triggered
WL_SOCKET_ACCEPT for sockets that we don't want to accept (yet).

Attachment

Re: Using WaitEventSet in the postmaster

From
Andres Freund
Date:
Hi,

On 2023-01-07 18:08:11 +1300, Thomas Munro wrote:
> On Sat, Jan 7, 2023 at 12:25 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2023-01-07 11:08:36 +1300, Thomas Munro wrote:
> > > 3.  Is it OK to clobber the shared pending flag for SIGQUIT, SIGTERM,
> > > SIGINT?  If you send all of these extremely rapidly, it's
> > > indeterminate which one will be seen by handle_shutdown_request().
> >
> > That doesn't seem optimal. I'm mostly worried that we can end up downgrading a
> > shutdown request.
> 
> I was contemplating whether I needed to do some more push-ups to
> prefer the first delivered signal (instead of the last), but you're
> saying that it would be enough to prefer the fastest shutdown type, in
> cases where more than one signal was handled between server loops.
> WFM.

I don't see any need for such an order requirement - in case of receiving a
"less severe" shutdown request first, we'd process the more severe one soon
after. There's nothing to be gained by trying to follow the order of the
incoming signals.

Afaict that's also the behaviour today. pmdie() has blocks like this:
        case SIGTERM:

            /*
             * Smart Shutdown:
             *
             * Wait for children to end their work, then shut down.
             */
            if (Shutdown >= SmartShutdown)
                break;



I briefly wondered about deduplicating that code, now that we we know the
requested mode ahead of the switch. So it could be a something like:

/* don't interrupt an already in-progress shutdown in a more "severe" mode */
if (mode < Shutdown)
   return;

but that's again probaly something for later.


> > > 5.  Is the signal mask being correctly handled during forking?  I
> > > think so: I decided to push the masking logic directly into the
> > > routine that forks, to make it easy to verify that all paths set the
> > > mask the way we want.
> >
> > Hm. If I understand correctly, you used sigprocmask() directly (vs
> > PG_SETMASK()) in fork_process() because you want the old mask? But why do we
> > restore the prior mask, instead of just using PG_SETMASK(&UnBlockSig); as we
> > still do in a bunch of places in the postmaster?
> 
> It makes zero difference in practice but I think it's a nicer way to
> code it because it doesn't make an unnecessary assumption about what
> the signal mask was on entry.

Heh, to me doing something slightly different than in other places actually
seems to make it a bit less nice :). There's also some benefit in the
certainty of knowing what the mask will be.  But it doesn't matter mcuh.


> > > 6.  Is the naming and style OK?  Better ideas welcome, but basically I
> > > tried to avoid all unnecessary refactoring and changes, so no real
> > > logic moves around, and the changes are pretty close to "mechanical".
> > > One bikeshed decision was what to call the {handle,process}_XXX
> > > functions and associated flags.
> >
> > I wonder if it'd be good to have a _pm_ in the name.
> 
> I dunno about this one, it's all static stuff in a file called
> postmaster.c and one (now) already has pm in it (see below).

I guess stuff like signal handlers and their state somehow seems more global
to me than their C linkage type suggests. Hence the desire to be a bit more
"namespaced" in their naming. I do find it somewhat annoying when reasonably
important global variables aren't uniquely named when using a debugger...

But again, this isn't anything that should hold up the patch.


> > Is there any reason to use MAXLISTEN here? We know how many sockets w're
> > listening on by this point, I think? No idea if the overhead matters anywhere,
> > but ...
> 
> Fixed.

I was thinking of determining the number once, in PostmasterMain(). But that's
perhaps better done as a separate change... WFM.


> > Hm. This is preexisting behaviour, but now it seems somewhat odd that we might
> > end up happily forking a backend for each socket without checking signals
> > inbetween.  Forking might take a while, and if a signal arrived since the
> > WaitEventSetWait() we'll not react to it.
> 
> Right, so if you have 64 server sockets and they all have clients
> waiting on their listen queue, we'll service one connection from each
> before getting back to checking for pmsignals or shutdown, and that's
> unchanged in this patch.  (FWIW I also noticed that problem while
> experimenting with the idea that you could handle multiple clients in
> one go on OSes that report the listen queue depth size along with
> WL_SOCKET_ACCEPT events, but didn't pursue it...)
> 
> I guess we could check every time through the nevents loop.  I may
> look into that in a later patch, but I prefer to keep the same policy
> in this patch.

Makes sense. This was mainly me trying to make sure we're not changing subtle
stuff accidentally (and trying to understand how things work currently, as a
prerequisite).


> > >  static void
> > >  PostmasterStateMachine(void)
> > > @@ -3796,6 +3819,9 @@ PostmasterStateMachine(void)
> > >
> > >       if (pmState == PM_WAIT_DEAD_END)
> > >       {
> > > +             /* Don't allow any new socket connection events. */
> > > +             ConfigurePostmasterWaitSet(false);
> >
> > Hm. Is anything actually using the wait set until we re-create it with (true)
> > below?
> 
> Yes.  While in PM_WAIT_DEAD_END state, waiting for children to exit,
> there may be clients trying to connect.

Oh, Right. I had misread the diff, thinking the
ConfigurePostmasterWaitSet(false) was in the same PM_NO_CHILDREN branch that
the ConfigurePostmasterWaitSet(true) was in.


> On master, we have a special pg_usleep(100000L) instead of select() just for
> that state so we can ignore incoming connections while waiting for the
> SIGCHLD reaper to advance our state, but in this new world that's not
> enough.  We need to wait for the latch to be set by
> handle_child_exit_signal().  So I used the regular WES to wait for the latch
> (that is, no more special case for that state), but I need to ignore socket
> events.  If I didn't, then an incoming connection sitting in the listen
> queue would cause the server loop to burn 100% CPU reporting a
> level-triggered WL_SOCKET_ACCEPT for sockets that we don't want to accept
> (yet).

Yea, this is clearly the better approach.


A few more code review comments:

DetermineSleepTime() still deals with struct timeval, which we maintain at
some effort. Just to then convert it away from struct timeval in the
WaitEventSetWait() call. That doesn't seem quite right, and is basically
introduced in this patch.


I think ServerLoop still has an outdated comment:

 *
 * NB: Needs to be called with signals blocked

which we aren't doing (nor need to be doing) anymore.


>          /*
> -         * New connection pending on any of our sockets? If so, fork a child
> -         * process to deal with it.
> +         * Latch set by signal handler, or new connection pending on any of
> +         * our sockets? If the latter, fork a child process to deal with it.
>           */
> -        if (selres > 0)
> +        for (int i = 0; i < nevents; i++)
>          {
> -            int            i;
> -
> -            for (i = 0; i < MAXLISTEN; i++)
> +            if (events[i].events & WL_LATCH_SET)
>              {
> -                if (ListenSocket[i] == PGINVALID_SOCKET)
> -                    break;
> -                if (FD_ISSET(ListenSocket[i], &rmask))
> +                ResetLatch(MyLatch);
> +
> +                /* Process work scheduled by signal handlers. */

Very minor: It feels a tad off to say that the work was scheduled by signal
handlers, it's either from other processes or by the OS. But ...


> +/*
> + * Child processes use SIGUSR1 to send 'pmsignals'.  pg_ctl uses SIGUSR1 to ask
> + * postmaster to check for logrotate and promote files.
> + */

s/send/notify us of/, since the concrete "pmsignal" is actually transported
outside of the "OS signal" level?


LGTM.


I think this is a significant improvement, thanks for working on it.


Greetings,

Andres Freund



Re: Using WaitEventSet in the postmaster

From
Thomas Munro
Date:
On Sun, Jan 8, 2023 at 11:55 AM Andres Freund <andres@anarazel.de> wrote:
> On 2023-01-07 18:08:11 +1300, Thomas Munro wrote:
> > On Sat, Jan 7, 2023 at 12:25 PM Andres Freund <andres@anarazel.de> wrote:
> > > On 2023-01-07 11:08:36 +1300, Thomas Munro wrote:
> > > > 3.  Is it OK to clobber the shared pending flag for SIGQUIT, SIGTERM,
> > > > SIGINT?  If you send all of these extremely rapidly, it's
> > > > indeterminate which one will be seen by handle_shutdown_request().
> > >
> > > That doesn't seem optimal. I'm mostly worried that we can end up downgrading a
> > > shutdown request.
> >
> > I was contemplating whether I needed to do some more push-ups to
> > prefer the first delivered signal (instead of the last), but you're
> > saying that it would be enough to prefer the fastest shutdown type, in
> > cases where more than one signal was handled between server loops.
> > WFM.
>
> I don't see any need for such an order requirement - in case of receiving a
> "less severe" shutdown request first, we'd process the more severe one soon
> after. There's nothing to be gained by trying to follow the order of the
> incoming signals.

Oh, I fully agree.  I was working through the realisation that I might
need to serialise the handlers to implement the priority logic
correctly (upgrades good, downgrades bad), but your suggestion
fast-forwards to the right answer and doesn't require blocking, so I
prefer it, and had already gone that way in v9.  In this version I've
added a comment to explain that the outcome is the same in the end,
and also fixed the flag clearing logic which was subtly wrong before.

> > > I wonder if it'd be good to have a _pm_ in the name.
> >
> > I dunno about this one, it's all static stuff in a file called
> > postmaster.c and one (now) already has pm in it (see below).
>
> I guess stuff like signal handlers and their state somehow seems more global
> to me than their C linkage type suggests. Hence the desire to be a bit more
> "namespaced" in their naming. I do find it somewhat annoying when reasonably
> important global variables aren't uniquely named when using a debugger...

Alright, renamed.

> A few more code review comments:
>
> DetermineSleepTime() still deals with struct timeval, which we maintain at
> some effort. Just to then convert it away from struct timeval in the
> WaitEventSetWait() call. That doesn't seem quite right, and is basically
> introduced in this patch.

I agree, but I was trying to minimise the patch: signals and events
stuff is a lot already.  I didn't want to touch DetermineSleepTime()'s
time logic in the same commit.  But here's a separate patch for that.

> I think ServerLoop still has an outdated comment:
>
>  *
>  * NB: Needs to be called with signals blocked

Fixed.

> > +                             /* Process work scheduled by signal handlers. */
>
> Very minor: It feels a tad off to say that the work was scheduled by signal
> handlers, it's either from other processes or by the OS. But ...

OK, now it's "requested via signal handlers".

> > +/*
> > + * Child processes use SIGUSR1 to send 'pmsignals'.  pg_ctl uses SIGUSR1 to ask
> > + * postmaster to check for logrotate and promote files.
> > + */
>
> s/send/notify us of/, since the concrete "pmsignal" is actually transported
> outside of the "OS signal" level?

Fixed.

> LGTM.

Thanks.  Here's v10.  I'll wait a bit longer to see if anyone else has feedback.

Attachment

Re: Using WaitEventSet in the postmaster

From
Thomas Munro
Date:
On Wed, Jan 11, 2023 at 4:07 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Thanks.  Here's v10.  I'll wait a bit longer to see if anyone else has feedback.

Pushed, after a few very minor adjustments, mostly comments.  Thanks
for the reviews and pointers.  I think there are quite a lot of
refactoring and refinement opportunities unlocked by this change (I
have some draft proposals already), but for now I'll keep an eye on
the build farm.



Re: Using WaitEventSet in the postmaster

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> Pushed, after a few very minor adjustments, mostly comments.  Thanks
> for the reviews and pointers.  I think there are quite a lot of
> refactoring and refinement opportunities unlocked by this change (I
> have some draft proposals already), but for now I'll keep an eye on
> the build farm.

skink seems to have found a problem:

==2011873== VALGRINDERROR-BEGIN
==2011873== Syscall param epoll_wait(events) points to unaddressable byte(s)
==2011873==    at 0x4D8DC73: epoll_wait (epoll_wait.c:30)
==2011873==    by 0x55CA49: WaitEventSetWaitBlock (latch.c:1527)
==2011873==    by 0x55D591: WaitEventSetWait (latch.c:1473)
==2011873==    by 0x4F2B28: ServerLoop (postmaster.c:1729)
==2011873==    by 0x4F3E85: PostmasterMain (postmaster.c:1452)
==2011873==    by 0x42643C: main (main.c:200)
==2011873==  Address 0x7b1e620 is 1,360 bytes inside a recently re-allocated block of size 8,192 alloc'd
==2011873==    at 0x48407B4: malloc (vg_replace_malloc.c:381)
==2011873==    by 0x6D9D30: AllocSetContextCreateInternal (aset.c:446)
==2011873==    by 0x4F2D9B: PostmasterMain (postmaster.c:614)
==2011873==    by 0x42643C: main (main.c:200)
==2011873==
==2011873== VALGRINDERROR-END

            regards, tom lane



Re: Using WaitEventSet in the postmaster

From
Thomas Munro
Date:
On Thu, Jan 12, 2023 at 7:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> skink seems to have found a problem:
>
> ==2011873== VALGRINDERROR-BEGIN
> ==2011873== Syscall param epoll_wait(events) points to unaddressable byte(s)
> ==2011873==    at 0x4D8DC73: epoll_wait (epoll_wait.c:30)
> ==2011873==    by 0x55CA49: WaitEventSetWaitBlock (latch.c:1527)
> ==2011873==    by 0x55D591: WaitEventSetWait (latch.c:1473)
> ==2011873==    by 0x4F2B28: ServerLoop (postmaster.c:1729)
> ==2011873==    by 0x4F3E85: PostmasterMain (postmaster.c:1452)
> ==2011873==    by 0x42643C: main (main.c:200)
> ==2011873==  Address 0x7b1e620 is 1,360 bytes inside a recently re-allocated block of size 8,192 alloc'd
> ==2011873==    at 0x48407B4: malloc (vg_replace_malloc.c:381)
> ==2011873==    by 0x6D9D30: AllocSetContextCreateInternal (aset.c:446)
> ==2011873==    by 0x4F2D9B: PostmasterMain (postmaster.c:614)
> ==2011873==    by 0x42643C: main (main.c:200)
> ==2011873==
> ==2011873== VALGRINDERROR-END

Repro'd here on Valgrind.  Oh, that's interesting.  WaitEventSetWait()
wants to use an internal buffer of the size given to the constructor
function, but passes the size of the caller's output buffer to
epoll_wait() and friends.  Perhaps it should use Min(nevents,
set->nevents_space).  I mean, I should have noticed that, but I think
that's arguably a pre-existing bug in the WES code, or at least an
unhelpful interface.  Thinking...



Re: Using WaitEventSet in the postmaster

From
Thomas Munro
Date:
On Thu, Jan 12, 2023 at 7:57 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Thu, Jan 12, 2023 at 7:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > skink seems to have found a problem:
> >
> > ==2011873== VALGRINDERROR-BEGIN
> > ==2011873== Syscall param epoll_wait(events) points to unaddressable byte(s)
> > ==2011873==    at 0x4D8DC73: epoll_wait (epoll_wait.c:30)
> > ==2011873==    by 0x55CA49: WaitEventSetWaitBlock (latch.c:1527)
> > ==2011873==    by 0x55D591: WaitEventSetWait (latch.c:1473)
> > ==2011873==    by 0x4F2B28: ServerLoop (postmaster.c:1729)
> > ==2011873==    by 0x4F3E85: PostmasterMain (postmaster.c:1452)
> > ==2011873==    by 0x42643C: main (main.c:200)
> > ==2011873==  Address 0x7b1e620 is 1,360 bytes inside a recently re-allocated block of size 8,192 alloc'd
> > ==2011873==    at 0x48407B4: malloc (vg_replace_malloc.c:381)
> > ==2011873==    by 0x6D9D30: AllocSetContextCreateInternal (aset.c:446)
> > ==2011873==    by 0x4F2D9B: PostmasterMain (postmaster.c:614)
> > ==2011873==    by 0x42643C: main (main.c:200)
> > ==2011873==
> > ==2011873== VALGRINDERROR-END
>
> Repro'd here on Valgrind.  Oh, that's interesting.  WaitEventSetWait()
> wants to use an internal buffer of the size given to the constructor
> function, but passes the size of the caller's output buffer to
> epoll_wait() and friends.  Perhaps it should use Min(nevents,
> set->nevents_space).  I mean, I should have noticed that, but I think
> that's arguably a pre-existing bug in the WES code, or at least an
> unhelpful interface.  Thinking...

Yeah.  This stops valgrind complaining here.

Attachment

Re: Using WaitEventSet in the postmaster

From
Andres Freund
Date:
Hi,

On 2023-01-12 20:35:43 +1300, Thomas Munro wrote:
> Subject: [PATCH] Fix WaitEventSetWait() buffer overrun.
> 
> The WAIT_USE_EPOLL and WAIT_USE_KQUEUE implementations of
> WaitEventSetWaitBlock() confused the size of their internal buffer with
> the size of the caller's output buffer, and could ask the kernel for too
> many events.  In fact the set of events retrieved from the kernel needs
> to be able to fit in both buffers, so take the minimum of the two.
> 
> The WAIT_USE_POLL and WAIT_USE WIN32 implementations didn't have this
> confusion.

> This probably didn't come up before because we always used the same
> number in both places, but commit 7389aad6 calculates a dynamic size at
> construction time, while using MAXLISTEN for its output event buffer on
> the stack.  That seems like a reasonable thing to want to do, so
> consider this to be a pre-existing bug worth fixing.

> As reported by skink, valgrind and Tom Lane.
> 
> Discussion: https://postgr.es/m/901504.1673504836%40sss.pgh.pa.us

Makes sense. We should backpatch this, I think?

Greetings,

Andres Freund



Re: Using WaitEventSet in the postmaster

From
Thomas Munro
Date:
On Fri, Jan 13, 2023 at 7:26 AM Andres Freund <andres@anarazel.de> wrote:
> On 2023-01-12 20:35:43 +1300, Thomas Munro wrote:
> > Subject: [PATCH] Fix WaitEventSetWait() buffer overrun.

> Makes sense. We should backpatch this, I think?

Done.



Re: Using WaitEventSet in the postmaster

From
Thomas Munro
Date:
The nearby thread about searching for uses of volatile reminded me: we
can now drop a bunch of these in postmaster.c.  The patch I originally
wrote to do that as part of this series somehow morphed into an
experimental patch to nuke all global variables[1], but of course we
should at least drop the now redundant use of volatile and
sigatomic_t.  See attached.

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKGKH_RPAo%3DNgPfHKj--565aL1qiVpUGdWt1_pmJehY%2Bdmw%40mail.gmail.com

Attachment

Re: Using WaitEventSet in the postmaster

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> The nearby thread about searching for uses of volatile reminded me: we
> can now drop a bunch of these in postmaster.c.  The patch I originally
> wrote to do that as part of this series somehow morphed into an
> experimental patch to nuke all global variables[1], but of course we
> should at least drop the now redundant use of volatile and
> sigatomic_t.  See attached.

+1

            regards, tom lane



Re: Using WaitEventSet in the postmaster

From
Andres Freund
Date:
Hi,

On 2023-01-28 14:25:38 +1300, Thomas Munro wrote:
> The nearby thread about searching for uses of volatile reminded me: we
> can now drop a bunch of these in postmaster.c.  The patch I originally
> wrote to do that as part of this series somehow morphed into an
> experimental patch to nuke all global variables[1],

Hah.


> but of course we should at least drop the now redundant use of volatile and
> sigatomic_t.  See attached.

+1

Greetings,

Andres Freund