Thread: Refactoring postmaster's code to cleanup after child exit

Refactoring postmaster's code to cleanup after child exit

From
Heikki Linnakangas
Date:
Reading through postmaster code, I spotted some refactoring 
opportunities to make it slightly more readable.

Currently, when a child process exits, the postmaster first scans 
through BackgroundWorkerList to see if it was a bgworker process. If not 
found, it scans through the BackendList to see if it was a backend 
process (which it really should be then). That feels a bit silly, 
because every running background worker process also has an entry in 
BackendList. There's a lot of duplication between 
CleanupBackgroundWorker and CleanupBackend.

Before commit 8a02b3d732, we used to created Backend entries only for 
background worker processes that connected to a database, not for other 
background worker processes. I think that's why we have the code 
structure we have. But now that we have a Backend entry for all bgworker 
processes, it's more natural to have single function to deal with both 
regular backends and bgworkers.

So I came up with the attached patches. This doesn't make any meaningful 
user-visible changes, except for some incidental changes in log messages 
(see commit message for details).

-- 
Heikki Linnakangas
Neon (https://neon.tech)
Attachment

Re: Refactoring postmaster's code to cleanup after child exit

From
Heikki Linnakangas
Date:
On 06/07/2024 22:01, Heikki Linnakangas wrote:
> Reading through postmaster code, I spotted some refactoring 
> opportunities to make it slightly more readable.
> 
> Currently, when a child process exits, the postmaster first scans 
> through BackgroundWorkerList to see if it was a bgworker process. If not 
> found, it scans through the BackendList to see if it was a backend 
> process (which it really should be then). That feels a bit silly, 
> because every running background worker process also has an entry in 
> BackendList. There's a lot of duplication between 
> CleanupBackgroundWorker and CleanupBackend.
> 
> Before commit 8a02b3d732, we used to created Backend entries only for 
> background worker processes that connected to a database, not for other 
> background worker processes. I think that's why we have the code 
> structure we have. But now that we have a Backend entry for all bgworker 
> processes, it's more natural to have single function to deal with both 
> regular backends and bgworkers.
> 
> So I came up with the attached patches. This doesn't make any meaningful 
> user-visible changes, except for some incidental changes in log messages 
> (see commit message for details).

New patch version attached. Fixed conflicts with recent commits, no 
other changes.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

Re: Refactoring postmaster's code to cleanup after child exit

From
Heikki Linnakangas
Date:
I committed the first two trivial patches, and have continued to work on 
postmaster.c, and how it manages all the child processes.

This is a lot of patches. They're built on top of each other, because 
that's the order I developed them in, but they probably could be applied 
in different order too. Please help me by reviewing these, before the 
stack grows even larger :-). Even partial reviews would be very helpful. 
I suggest to start reading them in order, and when you get tired, just 
send any comments you have up to that point.


* v3-0001-Make-BackgroundWorkerList-doubly-linked.patch

This is the same refactoring patch I started this thread with.

* v3-0003-Fix-comment-on-processes-being-kept-over-a-restar.patch
* v3-0004-Consolidate-postmaster-code-to-launch-background-.patch

Little refactoring of how postmaster launches the background processes.

* v3-0005-Add-test-for-connection-limits.patch
* v3-0006-Add-test-for-dead-end-backends.patch

A few new TAP tests for dead-end backends and enforcing connection 
limits. We didn't have much coverage for these before.

* v3-0007-Use-an-shmem_exit-callback-to-remove-backend-from.patch
* v3-0008-Introduce-a-separate-BackendType-for-dead-end-chi.patch

Some preliminary refactoring towards patch 
v3-0010-Assign-a-child-slot-to-every-postmaster-child-pro.patch

* v3-0009-Kill-dead-end-children-when-there-s-nothing-else-.patch

I noticed that we never send SIGTERM or SIGQUIT to dead-end backends, 
which seems silly. If the server is shutting down, dead-end backends 
might prevent the shutdown from completing. Dead-end backends will 
expire after authentication_timoeut (default 60s), so it won't last for 
too long, but still seems like we should kill dead-end backends if 
they're the only children preventing shutdown from completing.

* 3-0010-Assign-a-child-slot-to-every-postmaster-child-pro.patch

This is what I consider the main patch in this series. Currently, only 
regular backens, bgworkers and autovacuum workers have a PMChildFlags 
slot, which is used to detect when a postmaster child exits in an 
unclean way (in addition to the exit code). This patch assigns a child 
slot for all processes, except for dead-end backends. That includes all 
the aux processes.

While we're at it, I created separate pools of child slots for different 
kinds of backends, which fixes the issue that opening a lot of client 
connections can exhaust all the slots, so that background workers or 
autovacuum workers cannot start either [1].

[1] 
https://www.postgresql.org/message-id/55d2f50c-0b81-4b33-b202-cd2a406d69a3%40iki.fi

* v3-0011-Pass-MyPMChildSlot-as-an-explicit-argument-to-chi.patch

One more little refactoring, to pass MyPMChildSlot to the child process 
differently.


Where is all this leading? I'm not sure exactly, but having a postmaster 
child slot for every postmaster child seems highly useful. We could move 
the ProcSignal machinery to use those slot numbers for the indexes to 
the ProcSignal array, instead of ProcSignal, for example. That would 
allow all processes to participate in the signalling, even before they 
have a PGPROC entry. (Or with Thomas's interrupts refactoring, the 
interrupts array). With the multithreading work, PMChild struct could 
store a thread id, or whatever is needed for threads to communicate with 
each other. In any case, seems like it will come handy.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

Re: Refactoring postmaster's code to cleanup after child exit

From
Thomas Munro
Date:
On Fri, Aug 2, 2024 at 11:57 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> * v3-0001-Make-BackgroundWorkerList-doubly-linked.patch

LGTM.

> [v3-0002-Refactor-code-to-handle-death-of-a-backend-or-bgw.patch]

    Currently, when a child process exits, the postmaster first scans
    through BackgroundWorkerList, to see if it the child process was a
    background worker. If not found, then it scans through BackendList to
    see if it was a regular backend. That leads to some duplication
    between the bgworker and regular backend cleanup code, as both have an
    entry in the BackendList that needs to be cleaned up in the same way.
    Refactor that so that we scan just the BackendList to find the child
    process, and if it was a background worker, do the additional
    bgworker-specific cleanup in addition to the normal Backend cleanup.

Makes sense.

    On Windows, if a child process exits with ERROR_WAIT_NO_CHILDREN, it's
    now logged with that exit code, instead of 0. Also, if a bgworker
    exits with ERROR_WAIT_NO_CHILDREN, it's now treated as crashed and is
    restarted. Previously it was treated as a normal exit.

Interesting.  So when that error was first specially handled in this thread:


https://www.postgresql.org/message-id/flat/AANLkTimCTkNKKrHCd3Ot6kAsrSS7SeDpOTcaLsEP7i%2BM%40mail.gmail.com#41f60947571b75377f04af67ba6baf40

... it went from being considered a crash, to being considered like
exit(0).  It's true that shared memory can't be corrupted by a process
that never enters main(), but it's better not to hide the true reason
for the failure (if it is still possible -- I don't find many
references to that phenomenon in recent times).  Clobbering exitstatus
with 0 doesn't seem right at all, now that we have background workers
whose restart behaviour is affected by that.

    If a child process is not found in the BackendList, the log message
    now calls it "untracked child process" rather than "server process".
    Arguably that should be a PANIC, because we do track all the child
    processes in the list, so failing to find a child process is highly
    unexpected. But if we want to change that, let's discuss and do that
    as a separate commit.

Yeah, it would be highly unexpected if waitpid() told you about some
random other process (or we screwed up the bookkeeping and didn't
recognise it).  So at least having a different message seems good.

> * v3-0003-Fix-comment-on-processes-being-kept-over-a-restar.patch

+1

> * v3-0004-Consolidate-postmaster-code-to-launch-background-.patch

    Much of the code in process_pm_child_exit() to launch replacement
    processes when one exits or when progressing to next postmaster state
    was unnecessary, because the ServerLoop will launch any missing
    background processes anyway. Remove the redundant code and let
    ServerLoop handle it.

+1, makes sense.

    In ServerLoop, move the code to launch all the processes to a new
    subroutine, to group it all together.

+1, makes sense.

More soon...



Re: Refactoring postmaster's code to cleanup after child exit

From
Heikki Linnakangas
Date:
On 08/08/2024 13:47, Thomas Munro wrote:
>      On Windows, if a child process exits with ERROR_WAIT_NO_CHILDREN, it's
>      now logged with that exit code, instead of 0. Also, if a bgworker
>      exits with ERROR_WAIT_NO_CHILDREN, it's now treated as crashed and is
>      restarted. Previously it was treated as a normal exit.
> 
> Interesting.  So when that error was first specially handled in this thread:
> 
>
https://www.postgresql.org/message-id/flat/AANLkTimCTkNKKrHCd3Ot6kAsrSS7SeDpOTcaLsEP7i%2BM%40mail.gmail.com#41f60947571b75377f04af67ba6baf40
> 
> ... it went from being considered a crash, to being considered like
> exit(0).  It's true that shared memory can't be corrupted by a process
> that never enters main(), but it's better not to hide the true reason
> for the failure (if it is still possible -- I don't find many
> references to that phenomenon in recent times).  Clobbering exitstatus
> with 0 doesn't seem right at all, now that we have background workers
> whose restart behaviour is affected by that.

I adjusted this ERROR_WAIT_NO_CHILDREN a little more, to avoid logging 
the death of the child twice in some cases.

>> * v3-0003-Fix-comment-on-processes-being-kept-over-a-restar.patch
> 
> +1

Committed the patches up to and including this one, with tiny comment 
changes.

>> * v3-0004-Consolidate-postmaster-code-to-launch-background-.patch
> 
>      Much of the code in process_pm_child_exit() to launch replacement
>      processes when one exits or when progressing to next postmaster state
>      was unnecessary, because the ServerLoop will launch any missing
>      background processes anyway. Remove the redundant code and let
>      ServerLoop handle it.

I'm going to work a little more on the comments on this one before 
committing; I had just moved all the "If we have lost the XXX, try to 
start a new one" comments as is, but they look pretty repetitive now.

Thanks for the review!

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Refactoring postmaster's code to cleanup after child exit

From
Heikki Linnakangas
Date:
On 10/08/2024 00:13, Heikki Linnakangas wrote:
> On 08/08/2024 13:47, Thomas Munro wrote:
>>> * v3-0004-Consolidate-postmaster-code-to-launch-background-.patch
>>
>>      Much of the code in process_pm_child_exit() to launch replacement
>>      processes when one exits or when progressing to next postmaster 
>> state
>>      was unnecessary, because the ServerLoop will launch any missing
>>      background processes anyway. Remove the redundant code and let
>>      ServerLoop handle it.
> 
> I'm going to work a little more on the comments on this one before 
> committing; I had just moved all the "If we have lost the XXX, try to 
> start a new one" comments as is, but they look pretty repetitive now.

Pushed this now, after adjusting the comments a bit. Thanks again for 
the review!

Here are the remaining patches, rebased.

> commit a1c43d65907d20a999b203e465db1277ec842a0a
> Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date:   Thu Aug 1 17:24:12 2024 +0300
> 
>     Introduce a separate BackendType for dead-end children
>     
>     And replace postmaster.c's own "backend type" codes with BackendType
>     
>     XXX: While working on this, many times I accidentally did something
>     like "foo |= B_SOMETHING" instead of "foo |= 1 << B_SOMETHING", when
>     constructing arguments to SignalSomeChildren or CountChildren, and
>     things broke in very subtle ways taking a long time to debug. The old
>     constants that were already bitmasks avoided that. Maybe we need some
>     macro magic or something to make this less error-prone.

While rebasing this today, I spotted another instance of that mistake 
mentioned in the XXX comment above. I called "CountChildren(B_BACKEND)" 
instead of "CountChildren(1 << B_BACKEND)". Some ideas on how to make 
that less error-prone:

1. Add a separate typedef for the bitmasks, and macros/functions to work 
with it. Something like:

typedef struct {
    uint32        mask;
} BackendTypeMask;

static const BackendTypeMask BTMASK_ALL = { 0xffffffff };
static const BackendTypeMask BTMASK_NONE = { 0 };

static inline BackendTypeMask
BTMASK_ADD(BackendTypeMask mask, BackendType t)
{
    mask.mask |= 1 << t;
    return mask;
}

static inline BackendTypeMask
BTMASK_DEL(BackendTypeMask mask, BackendType t)
{
    mask.mask &= ~(1 << t);
    return mask;
}

Now the compiler will complain if you try to pass a BackendType for the 
mask. We could do this just for BackendType, or we could put this in 
src/include/lib/ with a more generic name, like "bitmask_u32".

2. Another idea is to redefine the BackendType values to be separate 
bits, like the current BACKEND_TYPE_* values in postmaster.c:

typedef enum BackendType
{
    B_INVALID = 0,

    /* Backends and other backend-like processes */
    B_BACKEND = 1 << 1,
    B_DEAD_END_BACKEND = 1 << 2,
    B_AUTOVAC_LAUNCHER = 1 << 3,
    B_AUTOVAC_WORKER = 1 << 4,

    ...
} BackendType;

Then you can use | and & on BackendTypes directly. It makes it less 
clear which function arguments are a BackendType and which are a 
bitmask, however.


Thoughts, other ideas?

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

Re: Refactoring postmaster's code to cleanup after child exit

From
Alexander Lakhin
Date:
Hello Heikki,

10.08.2024 00:13, Heikki Linnakangas wrote:
>
> Committed the patches up to and including this one, with tiny comment changes.

I've noticed that on the current HEAD server.log contains binary data
(an invalid process name) after a child crash. For example, while playing
with -ftapv, I've got:
SELECT to_date('2024 613566758 1', 'IYYY IW ID');
server closed the connection unexpectedly

grep -a 'was terminated' server.log
2024-08-18 07:07:06.482 UTC|||66c19d96.3482f6|LOG:  `�!x� (PID 3441407) was terminated by signal 6: Aborted

It looks like this was introduced by commit 28a520c0b (IIUC, namebuf in
CleanupBackend() may stay uninitialized in some code paths).

Best regards,
Alexander



Re: Refactoring postmaster's code to cleanup after child exit

From
Heikki Linnakangas
Date:
On 18/08/2024 11:00, Alexander Lakhin wrote:
> 10.08.2024 00:13, Heikki Linnakangas wrote:
>>
>> Committed the patches up to and including this one, with tiny comment changes.
> 
> I've noticed that on the current HEAD server.log contains binary data
> (an invalid process name) after a child crash. For example, while playing
> with -ftapv, I've got:
> SELECT to_date('2024 613566758 1', 'IYYY IW ID');
> server closed the connection unexpectedly
> 
> grep -a 'was terminated' server.log
> 2024-08-18 07:07:06.482 UTC|||66c19d96.3482f6|LOG:  `�!x� (PID 3441407) was terminated by signal 6: Aborted
> 
> It looks like this was introduced by commit 28a520c0b (IIUC, namebuf in
> CleanupBackend() may stay uninitialized in some code paths).

Fixed, thanks!

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Refactoring postmaster's code to cleanup after child exit

From
Andres Freund
Date:
Hi,

On 2024-08-12 12:55:00 +0300, Heikki Linnakangas wrote:
> While rebasing this today, I spotted another instance of that mistake
> mentioned in the XXX comment above. I called "CountChildren(B_BACKEND)"
> instead of "CountChildren(1 << B_BACKEND)". Some ideas on how to make that
> less error-prone:
> 
> 1. Add a separate typedef for the bitmasks, and macros/functions to work
> with it. Something like:
> 
> typedef struct {
>     uint32        mask;
> } BackendTypeMask;
> 
> static const BackendTypeMask BTMASK_ALL = { 0xffffffff };
> static const BackendTypeMask BTMASK_NONE = { 0 };
> 
> static inline BackendTypeMask
> BTMASK_ADD(BackendTypeMask mask, BackendType t)
> {
>     mask.mask |= 1 << t;
>     return mask;
> }
> 
> static inline BackendTypeMask
> BTMASK_DEL(BackendTypeMask mask, BackendType t)
> {
>     mask.mask &= ~(1 << t);
>     return mask;
> }
> 
> Now the compiler will complain if you try to pass a BackendType for the
> mask. We could do this just for BackendType, or we could put this in
> src/include/lib/ with a more generic name, like "bitmask_u32".

I don't like the second suggestion - that just ends up creating a similar
problem in the future because flag values for one thing can be passed to
something else.



> +Running the tests
> +=================
> +
> +NOTE: You must have given the --enable-tap-tests argument to configure.
> +
> +Run
> +    make check
> +or
> +    make installcheck
> +You can use "make installcheck" if you previously did "make install".
> +In that case, the code in the installation tree is tested.  With
> +"make check", a temporary installation tree is built from the current
> +sources and then tested.
> +
> +Either way, this test initializes, starts, and stops a test Postgres
> +cluster.
> +
> +See src/test/perl/README for more info about running these tests.

Is it really useful to have such instructions all over the tree?


> From 93b9e9b6e072f63af9009e0d66ab6d0d62ea8c15 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date: Mon, 12 Aug 2024 10:55:11 +0300
> Subject: [PATCH v4 2/8] Add test for dead-end backends
> 
> The code path for launching a dead-end backend because we're out of
> slots was not covered by any tests, so add one. (Some tests did hit
> the case of launching a dead-end backend because the server is still
> starting up, though, so the gap in our test coverage wasn't as big as
> it sounds.)
> ---
>  src/test/perl/PostgreSQL/Test/Cluster.pm      | 39 +++++++++++++++++++
>  .../postmaster/t/001_connection_limits.pl     | 17 +++++++-
>  2 files changed, 55 insertions(+), 1 deletion(-)

Why does this need to use "raw" connections?  Can't you just create a bunch of
connections with BackgroundPsql?




> From 88287a2db95e584018f1c7fa9e992feb7d179ce8 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date: Mon, 12 Aug 2024 10:58:35 +0300
> Subject: [PATCH v4 3/8] Use an shmem_exit callback to remove backend from
>  PMChildFlags on exit
> 
> This seems nicer than having to duplicate the logic between
> InitProcess() and ProcKill() for which child processes have a
> PMChildFlags slot.
> 
> Move the MarkPostmasterChildActive() call earlier in InitProcess(),
> out of the section protected by the spinlock.

> ---
>  src/backend/storage/ipc/pmsignal.c | 10 ++++++--
>  src/backend/storage/lmgr/proc.c    | 38 ++++++++++--------------------
>  src/include/storage/pmsignal.h     |  1 -
>  3 files changed, 21 insertions(+), 28 deletions(-)
> 
> diff --git a/src/backend/storage/ipc/pmsignal.c b/src/backend/storage/ipc/pmsignal.c
> index 27844b46a2..cb99e77476 100644
> --- a/src/backend/storage/ipc/pmsignal.c
> +++ b/src/backend/storage/ipc/pmsignal.c
> @@ -24,6 +24,7 @@
>  #include "miscadmin.h"
>  #include "postmaster/postmaster.h"
>  #include "replication/walsender.h"
> +#include "storage/ipc.h"
>  #include "storage/pmsignal.h"
>  #include "storage/shmem.h"
>  #include "utils/memutils.h"
> @@ -121,6 +122,8 @@ postmaster_death_handler(SIGNAL_ARGS)
>  
>  #endif                            /* USE_POSTMASTER_DEATH_SIGNAL */
>  
> +static void MarkPostmasterChildInactive(int code, Datum arg);
> +
>  /*
>   * PMSignalShmemSize
>   *        Compute space needed for pmsignal.c's shared memory
> @@ -328,6 +331,9 @@ MarkPostmasterChildActive(void)
>      slot--;
>      Assert(PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED);
>      PMSignalState->PMChildFlags[slot] = PM_CHILD_ACTIVE;
> +
> +    /* Arrange to clean up at exit. */
> +    on_shmem_exit(MarkPostmasterChildInactive, 0);
>  }
>  
>  /*
> @@ -352,8 +358,8 @@ MarkPostmasterChildWalSender(void)
>   * MarkPostmasterChildInactive - mark a postmaster child as done using
>   * shared memory.  This is called in the child process.
>   */
> -void
> -MarkPostmasterChildInactive(void)
> +static void
> +MarkPostmasterChildInactive(int code, Datum arg)
>  {
>      int            slot = MyPMChildSlot;
>  
> diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
> index ac66da8638..9536469e89 100644
> --- a/src/backend/storage/lmgr/proc.c
> +++ b/src/backend/storage/lmgr/proc.c
> @@ -308,6 +308,19 @@ InitProcess(void)
>      if (MyProc != NULL)
>          elog(ERROR, "you already exist");
>  
> +    /*
> +     * Before we start accessing the shared memory in a serious way, mark
> +     * ourselves as an active postmaster child; this is so that the postmaster
> +     * can detect it if we exit without cleaning up.  (XXX autovac launcher
> +     * currently doesn't participate in this; it probably should.)
> +     *
> +     * Slot sync worker also does not participate in it, see comments atop
> +     * 'struct bkend' in postmaster.c.
> +     */
> +    if (IsUnderPostmaster && !AmAutoVacuumLauncherProcess() &&
> +        !AmLogicalSlotSyncWorkerProcess())
> +        MarkPostmasterChildActive();

I'd not necessarily expect a call to MarkPostmasterChildActive() to register
an shmem exit hook - but I guess it's unlikely to be moved around in a
problematic way. Perhaps something like RegisterPostmasterChild() or such
would be a bit clearer?


> From dc53f89edbeec99f8633def8aa5f47cd98e7a150 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date: Mon, 12 Aug 2024 10:59:04 +0300
> Subject: [PATCH v4 4/8] Introduce a separate BackendType for dead-end children
> 
> And replace postmaster.c's own "backend type" codes with BackendType

Hm - it seems a bit odd to open-code this when we actually have a "table
driven configuration" available?  Why isn't the type a field in
child_process_kind?

That'd not solve the bitmask confusion issue, but it does seem like a better
direction to me?


> From 9c832ce33667abc5aef128a17fa9c27daaad872a Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date: Mon, 12 Aug 2024 10:59:27 +0300
> Subject: [PATCH v4 5/8] Kill dead-end children when there's nothing else left
> 
> Previously, the postmaster would never try to kill dead-end child
> processes, even if there were no other processes left. A dead-end
> backend will eventually exit, when authentication_timeout expires, but
> if a dead-end backend is the only thing that's preventing the server
> from shutting down, it seems better to kill it immediately. It's
> particularly important, if there was a bug in the early startup code
> that prevented a dead-end child from timing out and exiting normally.

I do wonder if we shouldn't instead get rid of dead end children. We now have
an event based loop in postmaster, it'd perform vastly better to juts handle
these connections in postmaster. And we'd get rid of these weird backend
types.   But I guess this is a worthwhile improvement on its own...


> Includes a test for that case where a dead-end backend previously kept
> the server from shutting down.

The test hardcodes timeouts, I think we've largely come to regret that when we
did. Should probably just be a multiplier based on
PostgreSQL::Test::Utils::timeout_default?



> +/*
> + * MaxLivePostmasterChildren
> + *
> + * This reports the number postmaster child processes that can be active.  It
> + * includes all children except for dead_end children.  This allows the array
> + * in shared memory (PMChildFlags) to have a fixed maximum size.
> + */
> +int
> +MaxLivePostmasterChildren(void)
> +{
> +    int            n = 0;
> +
> +    /* We know exactly how mamy worker and aux processes can be active */
> +    n += autovacuum_max_workers;
> +    n += max_worker_processes;
> +    n += NUM_AUXILIARY_PROCS;
> +
> +    /*
> +     * We allow more connections here than we can have backends because some
> +     * might still be authenticating; they might fail auth, or some existing
> +     * backend might exit before the auth cycle is completed.  The exact
> +     * MaxBackends limit is enforced when a new backend tries to join the
> +     * shared-inval backend array.
> +     */
> +    n += 2 * (MaxConnections + max_wal_senders);
> +
> +    return n;
> +}

I wonder if we could instead maintain at least some of this in
child_process_kinds? Manually listing different types of processes in
different places doesn't seem particularly sustainable.


> +/*
> + * Initialize at postmaster startup
> + */
> +void
> +InitPostmasterChildSlots(void)
> +{
> +    int            num_pmchild_slots;
> +    int            slotno;
> +    PMChild    *slots;
> +
> +    dlist_init(&freeBackendList);
> +    dlist_init(&freeAutoVacWorkerList);
> +    dlist_init(&freeBgWorkerList);
> +    dlist_init(&freeAuxList);
> +    dlist_init(&ActiveChildList);
> +
> +    num_pmchild_slots = MaxLivePostmasterChildren();
> +
> +    slots = palloc(num_pmchild_slots * sizeof(PMChild));
> +
> +    slotno = 0;
> +    for (int i = 0; i < 2 * (MaxConnections + max_wal_senders); i++)
> +    {
> +        init_slot(&slots[slotno], slotno, &freeBackendList);
> +        slotno++;
> +    }
> +    for (int i = 0; i < autovacuum_max_workers; i++)
> +    {
> +        init_slot(&slots[slotno], slotno, &freeAutoVacWorkerList);
> +        slotno++;
> +    }
> +    for (int i = 0; i < max_worker_processes; i++)
> +    {
> +        init_slot(&slots[slotno], slotno, &freeBgWorkerList);
> +        slotno++;
> +    }
> +    for (int i = 0; i < NUM_AUXILIARY_PROCS; i++)
> +    {
> +        init_slot(&slots[slotno], slotno, &freeAuxList);
> +        slotno++;
> +    }
> +    Assert(slotno == num_pmchild_slots);
> +}

Along the same vein - could we generalize this into one array of different
slot types and then loop over that to initialize / acquire the slots?


> +/* Return the appropriate free-list for the given backend type */
> +static dlist_head *
> +GetFreeList(BackendType btype)
> +{
> +    switch (btype)
> +    {
> +        case B_BACKEND:
> +        case B_BG_WORKER:
> +        case B_WAL_SENDER:
> +        case B_SLOTSYNC_WORKER:
> +            return &freeBackendList;

Maybe a daft question - but why are all of these in the same list? Sure,
they're essentially all full backends, but they're covered by different GUCs?


> +            /*
> +             * Auxiliary processes.  There can be only one of each of these
> +             * running at a time.
> +             */
> +        case B_AUTOVAC_LAUNCHER:
> +        case B_ARCHIVER:
> +        case B_BG_WRITER:
> +        case B_CHECKPOINTER:
> +        case B_STARTUP:
> +        case B_WAL_RECEIVER:
> +        case B_WAL_SUMMARIZER:
> +        case B_WAL_WRITER:
> +            return &freeAuxList;
> +
> +            /*
> +             * Logger is not connected to shared memory, and does not have a
> +             * PGPROC entry, but we still allocate a child slot for it.
> +             */

Tangential: Why do we need a freelist for these and why do we choose a random
pgproc for these instead of assigning one statically?

Background: I'd like to not provide AIO workers with "bounce buffers" (for IO
of buffers that can't be done in-place, like writes when checksums are
enabled). The varying proc numbers make that harder than it'd have to be...


> +PMChild *
> +AssignPostmasterChildSlot(BackendType btype)
> +{
> +    dlist_head *freelist;
> +    PMChild    *pmchild;
> +
> +    freelist = GetFreeList(btype);
> +
> +    if (dlist_is_empty(freelist))
> +        return NULL;
> +
> +    pmchild = dlist_container(PMChild, elem, dlist_pop_head_node(freelist));
> +    pmchild->pid = 0;
> +    pmchild->bkend_type = btype;
> +    pmchild->rw = NULL;
> +    pmchild->bgworker_notify = true;
> +
> +    /*
> +     * pmchild->child_slot for each entry was initialized when the array of
> +     * slots was allocated.
> +     */
> +
> +    dlist_push_head(&ActiveChildList, &pmchild->elem);
> +
> +    ReservePostmasterChildSlot(pmchild->child_slot);
> +
> +    /* FIXME: find a more elegant way to pass this */
> +    MyPMChildSlot = pmchild->child_slot;

What if we assigned one offset for each process and assigned its ID here and
also used that for its ProcNumber - that way we wouldn't need to manage
freelists in two places.


> +PMChild *
> +FindPostmasterChildByPid(int pid)
> +{
> +    dlist_iter    iter;
> +
> +    dlist_foreach(iter, &ActiveChildList)
> +    {
> +        PMChild    *bp = dlist_container(PMChild, elem, iter.cur);
> +
> +        if (bp->pid == pid)
> +            return bp;
> +    }
> +    return NULL;
> +}

It's not new, but it's quite sad that postmaster's process exit handling is
effectively O(Backends^2)...



> @@ -1019,7 +980,15 @@ PostmasterMain(int argc, char *argv[])
>      /*
>       * If enabled, start up syslogger collection subprocess
>       */
> -    SysLoggerPID = SysLogger_Start();
> +    SysLoggerPMChild = AssignPostmasterChildSlot(B_LOGGER);
> +    if (!SysLoggerPMChild)
> +        elog(ERROR, "no postmaster child slot available for syslogger");
> +    SysLoggerPMChild->pid = SysLogger_Start();
> +    if (SysLoggerPMChild->pid == 0)
> +    {
> +        FreePostmasterChildSlot(SysLoggerPMChild);
> +        SysLoggerPMChild = NULL;
> +    }

Maybe it's a bit obsessive, but this seems long enough to make it worth not
doing inline in the already long PostmasterMain().


>      /*
>       * We're ready to rock and roll...
>       */
> -    StartupPID = StartChildProcess(B_STARTUP);
> -    Assert(StartupPID != 0);
> +    StartupPMChild = StartChildProcess(B_STARTUP);
> +    Assert(StartupPMChild != NULL);

This (not new) assertion is ... odd.


> @@ -1779,21 +1748,6 @@ canAcceptConnections(int backend_type)
>      if (!connsAllowed && backend_type == B_BACKEND)
>          return CAC_SHUTDOWN;    /* shutdown is pending */
>  
> -    /*
> -     * Don't start too many children.
> -     *
> -     * We allow more connections here than we can have backends because some
> -     * might still be authenticating; they might fail auth, or some existing
> -     * backend might exit before the auth cycle is completed.  The exact
> -     * MaxBackends limit is enforced when a new backend tries to join the
> -     * shared-inval backend array.
> -     *
> -     * The limit here must match the sizes of the per-child-process arrays;
> -     * see comments for MaxLivePostmasterChildren().
> -     */
> -    if (CountChildren(BACKEND_TYPE_ALL & ~(1 << B_DEAD_END_BACKEND)) >= MaxLivePostmasterChildren())
> -        result = CAC_TOOMANY;
> -
>      return result;
>  }

It's nice to get rid of this source of O(N^2).


> @@ -1961,26 +1915,6 @@ process_pm_reload_request(void)
>                  (errmsg("received SIGHUP, reloading configuration files")));
>          ProcessConfigFile(PGC_SIGHUP);
>          SignalSomeChildren(SIGHUP, BACKEND_TYPE_ALL & ~(1 << B_DEAD_END_BACKEND));
> -        if (StartupPID != 0)
> -            signal_child(StartupPID, SIGHUP);
> -        if (BgWriterPID != 0)
> -            signal_child(BgWriterPID, SIGHUP);
> -        if (CheckpointerPID != 0)
> -            signal_child(CheckpointerPID, SIGHUP);
> -        if (WalWriterPID != 0)
> -            signal_child(WalWriterPID, SIGHUP);
> -        if (WalReceiverPID != 0)
> -            signal_child(WalReceiverPID, SIGHUP);
> -        if (WalSummarizerPID != 0)
> -            signal_child(WalSummarizerPID, SIGHUP);
> -        if (AutoVacPID != 0)
> -            signal_child(AutoVacPID, SIGHUP);
> -        if (PgArchPID != 0)
> -            signal_child(PgArchPID, SIGHUP);
> -        if (SysLoggerPID != 0)
> -            signal_child(SysLoggerPID, SIGHUP);
> -        if (SlotSyncWorkerPID != 0)
> -            signal_child(SlotSyncWorkerPID, SIGHUP);
>  
>          /* Reload authentication config files too */
>          if (!load_hba())

For a moment I wondered why this change was part of this commit - but I guess
we didn't have any of these in an array/list before this change...



> @@ -2469,11 +2410,15 @@ process_pm_child_exit(void)
>          }
>  
>          /* Was it the system logger?  If so, try to start a new one */
> -        if (pid == SysLoggerPID)
> +        if (SysLoggerPMChild && pid == SysLoggerPMChild->pid)
>          {
> -            SysLoggerPID = 0;
>              /* for safety's sake, launch new logger *first* */
> -            SysLoggerPID = SysLogger_Start();
> +            SysLoggerPMChild->pid = SysLogger_Start();
> +            if (SysLoggerPMChild->pid == 0)
> +            {
> +                FreePostmasterChildSlot(SysLoggerPMChild);
> +                SysLoggerPMChild = NULL;
> +            }
>              if (!EXIT_STATUS_0(exitstatus))
>                  LogChildExit(LOG, _("system logger process"),

Seems a bit weird to have one place with a different memory lifetime handling
than other places. Why don't we just do this the same way as in other places
but continue to defer the logging until after we tried to start the new
logger?

Might be worth having a test ensuring that loggers restart OK.


>      /* Construct a process name for log message */
> +
> +    /*
> +     * FIXME: use GetBackendTypeDesc here? How does the localization of that
> +     * work?
> +     */
>      if (bp->bkend_type == B_DEAD_END_BACKEND)
>      {
>          procname = _("dead end backend");

Might be worth having a version of GetBackendTypeDesc() that returns a
translated string?


> @@ -2697,9 +2643,16 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
>      {
>          dlist_iter    iter;
>  
> -        dlist_foreach(iter, &BackendList)
> +        dlist_foreach(iter, &ActiveChildList)
>          {
> -            Backend    *bp = dlist_container(Backend, elem, iter.cur);
> +            PMChild    *bp = dlist_container(PMChild, elem, iter.cur);
> +
> +            /* We do NOT restart the syslogger */
> +            if (bp == SysLoggerPMChild)
> +                continue;

That comment seems a bit misleading - we do restart syslogger, we just don't
do it here, no?  I realize it's an old comment, but it still seems like it's
worth fixing given that you touch all the code here...


> @@ -2708,48 +2661,8 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
>               * We could exclude dead_end children here, but at least when
>               * sending SIGABRT it seems better to include them.
>               */
> -            sigquit_child(bp->pid);
> +            sigquit_child(bp);
>          }
> -
> -        if (StartupPID != 0)
> -        {
> -            sigquit_child(StartupPID);
> -            StartupStatus = STARTUP_SIGNALED;
> -        }
> -
> -        /* Take care of the bgwriter too */
> -        if (BgWriterPID != 0)
> -            sigquit_child(BgWriterPID);
> -
> -        /* Take care of the checkpointer too */
> -        if (CheckpointerPID != 0)
> -            sigquit_child(CheckpointerPID);
> -
> -        /* Take care of the walwriter too */
> -        if (WalWriterPID != 0)
> -            sigquit_child(WalWriterPID);
> -
> -        /* Take care of the walreceiver too */
> -        if (WalReceiverPID != 0)
> -            sigquit_child(WalReceiverPID);
> -
> -        /* Take care of the walsummarizer too */
> -        if (WalSummarizerPID != 0)
> -            sigquit_child(WalSummarizerPID);
> -
> -        /* Take care of the autovacuum launcher too */
> -        if (AutoVacPID != 0)
> -            sigquit_child(AutoVacPID);
> -
> -        /* Take care of the archiver too */
> -        if (PgArchPID != 0)
> -            sigquit_child(PgArchPID);
> -
> -        /* Take care of the slot sync worker too */
> -        if (SlotSyncWorkerPID != 0)
> -            sigquit_child(SlotSyncWorkerPID);
> -
> -        /* We do NOT restart the syslogger */
>      }

Yay.




> @@ -2871,29 +2786,27 @@ PostmasterStateMachine(void)
> <snip>
> -        if (WalSummarizerPID != 0)
> -            signal_child(WalSummarizerPID, SIGTERM);
> -        if (SlotSyncWorkerPID != 0)
> -            signal_child(SlotSyncWorkerPID, SIGTERM);
> +        targetMask |= (1 << B_STARTUP);
> +        targetMask |= (1 << B_WAL_RECEIVER);
> +
> +        targetMask |= (1 << B_WAL_SUMMARIZER);
> +        targetMask |= (1 << B_SLOTSYNC_WORKER);
>          /* checkpointer, archiver, stats, and syslogger may continue for now */
>  
> +        SignalSomeChildren(SIGTERM, targetMask);
> +
>          /* Now transition to PM_WAIT_BACKENDS state to wait for them to die */
>          pmState = PM_WAIT_BACKENDS;
> <snip>

It's likely the right thing to not do as one patch, but IMO this really wants
to be a state table. Perhaps as part of child_process_kinds, perhaps separate
from that.


> @@ -3130,8 +3047,21 @@ static void
>  LaunchMissingBackgroundProcesses(void)
>  {
>      /* Syslogger is active in all states */
> -    if (SysLoggerPID == 0 && Logging_collector)
> -        SysLoggerPID = SysLogger_Start();
> +    if (SysLoggerPMChild == NULL && Logging_collector)
> +    {
> +        SysLoggerPMChild = AssignPostmasterChildSlot(B_LOGGER);
> +        if (!SysLoggerPMChild)
> +            elog(LOG, "no postmaster child slot available for syslogger");

How could this elog() be reached? Seems something seriously would have gone
wrong to get here - in which case a LOG that might not even be visible (due to
logger not working) doesn't seem like the right response.

> @@ -3334,29 +3270,12 @@ SignalSomeChildren(int signal, uint32 targetMask)
>  static void
>  TerminateChildren(int signal)
>  {

The comment for TerminateChildren() says "except syslogger and dead_end
backends." - aren't you including the latter here?




> @@ -311,14 +311,9 @@ InitProcess(void)
>      /*
>       * Before we start accessing the shared memory in a serious way, mark
>       * ourselves as an active postmaster child; this is so that the postmaster
> -     * can detect it if we exit without cleaning up.  (XXX autovac launcher
> -     * currently doesn't participate in this; it probably should.)
> -     *
> -     * Slot sync worker also does not participate in it, see comments atop
> -     * 'struct bkend' in postmaster.c.
> +     * can detect it if we exit without cleaning up.
>       */
> -    if (IsUnderPostmaster && !AmAutoVacuumLauncherProcess() &&
> -        !AmLogicalSlotSyncWorkerProcess())
> +    if (IsUnderPostmaster)
>          MarkPostmasterChildActive();
>  
>      /* Decide which list should supply our PGPROC. */
> @@ -536,6 +531,9 @@ InitAuxiliaryProcess(void)
>      if (MyProc != NULL)
>          elog(ERROR, "you already exist");
>  
> +    if (IsUnderPostmaster)
> +        MarkPostmasterChildActive();
> +
>      /*
>       * We use the ProcStructLock to protect assignment and releasing of
>       * AuxiliaryProcs entries.

Probably worth, at some point soon, to have an InitProcessCommon() or such.


Greetings,

Andres Freund



Re: Refactoring postmaster's code to cleanup after child exit

From
Heikki Linnakangas
Date:
On 04/09/2024 17:35, Andres Freund wrote:
> On 2024-08-12 12:55:00 +0300, Heikki Linnakangas wrote:
>> +Running the tests
>> +=================
>> +
>> +NOTE: You must have given the --enable-tap-tests argument to configure.
>> +
>> +Run
>> +    make check
>> +or
>> +    make installcheck
>> +You can use "make installcheck" if you previously did "make install".
>> +In that case, the code in the installation tree is tested.  With
>> +"make check", a temporary installation tree is built from the current
>> +sources and then tested.
>> +
>> +Either way, this test initializes, starts, and stops a test Postgres
>> +cluster.
>> +
>> +See src/test/perl/README for more info about running these tests.
> 
> Is it really useful to have such instructions all over the tree?

That's debatable but I didn't want to go down that rabbit hole with this 
patch.

It's repetitive for sure. But there are small variations in which 
PG_TEST_EXTRA options you need, whether "make installcheck" runs against 
a running server or still creates a temporary cluster, etc.

I tried to deduplicate those instructions by moving the above 
boilerplate to src/test/README, and only noting the variations in the 
subdirectory READMEs. I didn't like the result. It's very helpful to 
have full copy-pasteable commands with all the right "PG_TEST_EXTRA" 
options for each test.

These instructions also don't mention how to run the tests with Meson. 
The first time I wanted to run individual tests with Meson, it took me a 
while to figure it out.

I'll think a little more about how to improve these READMEs, but let's 
take that to a separate thread.

>>  From 93b9e9b6e072f63af9009e0d66ab6d0d62ea8c15 Mon Sep 17 00:00:00 2001
>> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
>> Date: Mon, 12 Aug 2024 10:55:11 +0300
>> Subject: [PATCH v4 2/8] Add test for dead-end backends
>>
>> The code path for launching a dead-end backend because we're out of
>> slots was not covered by any tests, so add one. (Some tests did hit
>> the case of launching a dead-end backend because the server is still
>> starting up, though, so the gap in our test coverage wasn't as big as
>> it sounds.)
>> ---
>>   src/test/perl/PostgreSQL/Test/Cluster.pm      | 39 +++++++++++++++++++
>>   .../postmaster/t/001_connection_limits.pl     | 17 +++++++-
>>   2 files changed, 55 insertions(+), 1 deletion(-)
> 
> Why does this need to use "raw" connections?  Can't you just create a bunch of
> connections with BackgroundPsql?

No, these need to be connections that haven't sent the startup packet 
the yet.

With Andrew's PqFFI work [1], we could do better. The latest version on 
that thread doesn't expose the async functions like PQconnectStart() 
PQconnectPoll() though, but they can be added.

[1] 
https://www.postgresql.org/message-id/97d1d1b9-d147-f69d-1991-d8794efed41c%40dunslane.net


Unless you have comments on these first two patches which just add 
tests, I'll commit them shortly. Still processing the rest of your 
comments...

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Refactoring postmaster's code to cleanup after child exit

From
Heikki Linnakangas
Date:
On 04/09/2024 17:35, Andres Freund wrote:
> On 2024-08-12 12:55:00 +0300, Heikki Linnakangas wrote:
>>  From dc53f89edbeec99f8633def8aa5f47cd98e7a150 Mon Sep 17 00:00:00 2001
>> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
>> Date: Mon, 12 Aug 2024 10:59:04 +0300
>> Subject: [PATCH v4 4/8] Introduce a separate BackendType for dead-end children
>>
>> And replace postmaster.c's own "backend type" codes with BackendType
> 
> Hm - it seems a bit odd to open-code this when we actually have a "table
> driven configuration" available?  Why isn't the type a field in
> child_process_kind?

Sorry, I didn't understand this. What exactly would you add to 
child_process_kind? Where would you use it?

>> +/*
>> + * MaxLivePostmasterChildren
>> + *
>> + * This reports the number postmaster child processes that can be active.  It
>> + * includes all children except for dead_end children.  This allows the array
>> + * in shared memory (PMChildFlags) to have a fixed maximum size.
>> + */
>> +int
>> +MaxLivePostmasterChildren(void)
>> +{
>> +    int            n = 0;
>> +
>> +    /* We know exactly how mamy worker and aux processes can be active */
>> +    n += autovacuum_max_workers;
>> +    n += max_worker_processes;
>> +    n += NUM_AUXILIARY_PROCS;
>> +
>> +    /*
>> +     * We allow more connections here than we can have backends because some
>> +     * might still be authenticating; they might fail auth, or some existing
>> +     * backend might exit before the auth cycle is completed.  The exact
>> +     * MaxBackends limit is enforced when a new backend tries to join the
>> +     * shared-inval backend array.
>> +     */
>> +    n += 2 * (MaxConnections + max_wal_senders);
>> +
>> +    return n;
>> +}
> 
> I wonder if we could instead maintain at least some of this in
> child_process_kinds? Manually listing different types of processes in
> different places doesn't seem particularly sustainable.

Hmm, you mean adding "max this kind of children" field to 
child_process_kinds? Perhaps.

> 
>> +/*
>> + * Initialize at postmaster startup
>> + */
>> +void
>> +InitPostmasterChildSlots(void)
>> +{
>> +    int            num_pmchild_slots;
>> +    int            slotno;
>> +    PMChild    *slots;
>> +
>> +    dlist_init(&freeBackendList);
>> +    dlist_init(&freeAutoVacWorkerList);
>> +    dlist_init(&freeBgWorkerList);
>> +    dlist_init(&freeAuxList);
>> +    dlist_init(&ActiveChildList);
>> +
>> +    num_pmchild_slots = MaxLivePostmasterChildren();
>> +
>> +    slots = palloc(num_pmchild_slots * sizeof(PMChild));
>> +
>> +    slotno = 0;
>> +    for (int i = 0; i < 2 * (MaxConnections + max_wal_senders); i++)
>> +    {
>> +        init_slot(&slots[slotno], slotno, &freeBackendList);
>> +        slotno++;
>> +    }
>> +    for (int i = 0; i < autovacuum_max_workers; i++)
>> +    {
>> +        init_slot(&slots[slotno], slotno, &freeAutoVacWorkerList);
>> +        slotno++;
>> +    }
>> +    for (int i = 0; i < max_worker_processes; i++)
>> +    {
>> +        init_slot(&slots[slotno], slotno, &freeBgWorkerList);
>> +        slotno++;
>> +    }
>> +    for (int i = 0; i < NUM_AUXILIARY_PROCS; i++)
>> +    {
>> +        init_slot(&slots[slotno], slotno, &freeAuxList);
>> +        slotno++;
>> +    }
>> +    Assert(slotno == num_pmchild_slots);
>> +}
> 
> Along the same vein - could we generalize this into one array of different
> slot types and then loop over that to initialize / acquire the slots?

Makes sense.

>> +/* Return the appropriate free-list for the given backend type */
>> +static dlist_head *
>> +GetFreeList(BackendType btype)
>> +{
>> +    switch (btype)
>> +    {
>> +        case B_BACKEND:
>> +        case B_BG_WORKER:
>> +        case B_WAL_SENDER:
>> +        case B_SLOTSYNC_WORKER:
>> +            return &freeBackendList;
> 
> Maybe a daft question - but why are all of these in the same list? Sure,
> they're essentially all full backends, but they're covered by different GUCs?

No reason. No particular reason they should *not* share the same list 
either though.

> 
>> +            /*
>> +             * Auxiliary processes.  There can be only one of each of these
>> +             * running at a time.
>> +             */
>> +        case B_AUTOVAC_LAUNCHER:
>> +        case B_ARCHIVER:
>> +        case B_BG_WRITER:
>> +        case B_CHECKPOINTER:
>> +        case B_STARTUP:
>> +        case B_WAL_RECEIVER:
>> +        case B_WAL_SUMMARIZER:
>> +        case B_WAL_WRITER:
>> +            return &freeAuxList;
>> +
>> +            /*
>> +             * Logger is not connected to shared memory, and does not have a
>> +             * PGPROC entry, but we still allocate a child slot for it.
>> +             */
> 
> Tangential: Why do we need a freelist for these and why do we choose a random
> pgproc for these instead of assigning one statically?
> 
> Background: I'd like to not provide AIO workers with "bounce buffers" (for IO
> of buffers that can't be done in-place, like writes when checksums are
> enabled). The varying proc numbers make that harder than it'd have to be...

Yeah, we can make these fixed.Currently, the # of slots reserved for aux 
processes is sized by NUM_AUXILIARY_PROCS, which is one smaller than the 
number of different aux proces kinds:

> /*
>  * We set aside some extra PGPROC structures for auxiliary processes,
>  * ie things that aren't full-fledged backends but need shmem access.
>  *
>  * Background writer, checkpointer, WAL writer, WAL summarizer, and archiver
>  * run during normal operation.  Startup process and WAL receiver also consume
>  * 2 slots, but WAL writer is launched only after startup has exited, so we
>  * only need 6 slots.
>  */
> #define NUM_AUXILIARY_PROCS        6

For PMChildSlot numbers, we could certainly just allocate one more slot.

It would probably make sense for PGPROCs too, even though PGPROC is a 
much larger struct.

>> +PMChild *
>> +AssignPostmasterChildSlot(BackendType btype)
>> +{
>> +    dlist_head *freelist;
>> +    PMChild    *pmchild;
>> +
>> +    freelist = GetFreeList(btype);
>> +
>> +    if (dlist_is_empty(freelist))
>> +        return NULL;
>> +
>> +    pmchild = dlist_container(PMChild, elem, dlist_pop_head_node(freelist));
>> +    pmchild->pid = 0;
>> +    pmchild->bkend_type = btype;
>> +    pmchild->rw = NULL;
>> +    pmchild->bgworker_notify = true;
>> +
>> +    /*
>> +     * pmchild->child_slot for each entry was initialized when the array of
>> +     * slots was allocated.
>> +     */
>> +
>> +    dlist_push_head(&ActiveChildList, &pmchild->elem);
>> +
>> +    ReservePostmasterChildSlot(pmchild->child_slot);
>> +
>> +    /* FIXME: find a more elegant way to pass this */
>> +    MyPMChildSlot = pmchild->child_slot;
> 
> What if we assigned one offset for each process and assigned its ID here and
> also used that for its ProcNumber - that way we wouldn't need to manage
> freelists in two places.

It's currently possible to have up to 2 * max_connections backends in 
the authentication phase. We would have to change that behaviour, or 
make the PGPROC array 2x larger.

It might well be worth it, I don't know how sensible the current 
behaviour is. But I'd like to punt that to later patch, to keep the 
scope of this patch set reasonable. It's pretty straightforward to do 
later on top of this if we want to.

>> +PMChild *
>> +FindPostmasterChildByPid(int pid)
>> +{
>> +    dlist_iter    iter;
>> +
>> +    dlist_foreach(iter, &ActiveChildList)
>> +    {
>> +        PMChild    *bp = dlist_container(PMChild, elem, iter.cur);
>> +
>> +        if (bp->pid == pid)
>> +            return bp;
>> +    }
>> +    return NULL;
>> +}
> 
> It's not new, but it's quite sad that postmaster's process exit handling is
> effectively O(Backends^2)...

It would be straightforward to turn ActiveChildList into a hash table. 
But I'd like to leave that to a followup patch too.

>>       /*
>>        * We're ready to rock and roll...
>>        */
>> -    StartupPID = StartChildProcess(B_STARTUP);
>> -    Assert(StartupPID != 0);
>> +    StartupPMChild = StartChildProcess(B_STARTUP);
>> +    Assert(StartupPMChild != NULL);
> 
> This (not new) assertion is ... odd.

Yeah, it's an assertion because StartChildProcess has this:

>         /*
>          * fork failure is fatal during startup, but there's no need to choke
>          * immediately if starting other child types fails.
>          */
>         if (type == B_STARTUP)
>             ExitPostmaster(1);


>> @@ -1961,26 +1915,6 @@ process_pm_reload_request(void)
>>                   (errmsg("received SIGHUP, reloading configuration files")));
>>           ProcessConfigFile(PGC_SIGHUP);
>>           SignalSomeChildren(SIGHUP, BACKEND_TYPE_ALL & ~(1 << B_DEAD_END_BACKEND));
>> -        if (StartupPID != 0)
>> -            signal_child(StartupPID, SIGHUP);
>> -        if (BgWriterPID != 0)
>> -            signal_child(BgWriterPID, SIGHUP);
>> -        if (CheckpointerPID != 0)
>> -            signal_child(CheckpointerPID, SIGHUP);
>> -        if (WalWriterPID != 0)
>> -            signal_child(WalWriterPID, SIGHUP);
>> -        if (WalReceiverPID != 0)
>> -            signal_child(WalReceiverPID, SIGHUP);
>> -        if (WalSummarizerPID != 0)
>> -            signal_child(WalSummarizerPID, SIGHUP);
>> -        if (AutoVacPID != 0)
>> -            signal_child(AutoVacPID, SIGHUP);
>> -        if (PgArchPID != 0)
>> -            signal_child(PgArchPID, SIGHUP);
>> -        if (SysLoggerPID != 0)
>> -            signal_child(SysLoggerPID, SIGHUP);
>> -        if (SlotSyncWorkerPID != 0)
>> -            signal_child(SlotSyncWorkerPID, SIGHUP);
>>   
>>           /* Reload authentication config files too */
>>           if (!load_hba())
> 
> For a moment I wondered why this change was part of this commit - but I guess
> we didn't have any of these in an array/list before this change...

Correct.

>> @@ -2469,11 +2410,15 @@ process_pm_child_exit(void)
>>           }
>>   
>>           /* Was it the system logger?  If so, try to start a new one */
>> -        if (pid == SysLoggerPID)
>> +        if (SysLoggerPMChild && pid == SysLoggerPMChild->pid)
>>           {
>> -            SysLoggerPID = 0;
>>               /* for safety's sake, launch new logger *first* */
>> -            SysLoggerPID = SysLogger_Start();
>> +            SysLoggerPMChild->pid = SysLogger_Start();
>> +            if (SysLoggerPMChild->pid == 0)
>> +            {
>> +                FreePostmasterChildSlot(SysLoggerPMChild);
>> +                SysLoggerPMChild = NULL;
>> +            }
>>               if (!EXIT_STATUS_0(exitstatus))
>>                   LogChildExit(LOG, _("system logger process"),
> 
> Seems a bit weird to have one place with a different memory lifetime handling
> than other places. Why don't we just do this the same way as in other places
> but continue to defer the logging until after we tried to start the new
> logger?

Hmm, you mean let LaunchMissingBackgroundProcesses() handle the restart?

I'm a little scared of changing the existing logic. We don't have a 
mechanism for deferring logging, so we would have to invent that, or the 
logs would just accumulate in the pipe until syslogger starts up. 
There's some code between here and LaunchMissingBackgroundProcesses(), 
so might postmaster get blocked between writing to the syslogger pipe, 
before having restarted it?

If forking the syslogger process fails, that can happen anyway, though.

> Might be worth having a test ensuring that loggers restart OK.

Yeah..

>>       /* Construct a process name for log message */
>> +
>> +    /*
>> +     * FIXME: use GetBackendTypeDesc here? How does the localization of that
>> +     * work?
>> +     */
>>       if (bp->bkend_type == B_DEAD_END_BACKEND)
>>       {
>>           procname = _("dead end backend");
> 
> Might be worth having a version of GetBackendTypeDesc() that returns a
> translated string?

Constructing the string for background workers is a little more complicated:

     snprintf(namebuf, MAXPGPATH, _("background worker \"%s\""),
              bp->rw->rw_worker.bgw_type);

We could still do that for background workers and use the transalated 
variant of GetBackendTypeDesc() for everything else though.

>> @@ -2697,9 +2643,16 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
>>       {
>>           dlist_iter    iter;
>>   
>> -        dlist_foreach(iter, &BackendList)
>> +        dlist_foreach(iter, &ActiveChildList)
>>           {
>> -            Backend    *bp = dlist_container(Backend, elem, iter.cur);
>> +            PMChild    *bp = dlist_container(PMChild, elem, iter.cur);
>> +
>> +            /* We do NOT restart the syslogger */
>> +            if (bp == SysLoggerPMChild)
>> +                continue;
> 
> That comment seems a bit misleading - we do restart syslogger, we just don't
> do it here, no?  I realize it's an old comment, but it still seems like it's
> worth fixing given that you touch all the code here...

No, we really do not restart the syslogger. This code runs when 
*another* process has crashed unexpectedly. We kill all other processes, 
reinitialize shared memory and restart, but the old syslogger keeps 
running through all that.

I'll add a note on that to InitPostmasterChildSlots(), as it's a bit 
surprising.

>> @@ -2871,29 +2786,27 @@ PostmasterStateMachine(void)
>> <snip>
>> -        if (WalSummarizerPID != 0)
>> -            signal_child(WalSummarizerPID, SIGTERM);
>> -        if (SlotSyncWorkerPID != 0)
>> -            signal_child(SlotSyncWorkerPID, SIGTERM);
>> +        targetMask |= (1 << B_STARTUP);
>> +        targetMask |= (1 << B_WAL_RECEIVER);
>> +
>> +        targetMask |= (1 << B_WAL_SUMMARIZER);
>> +        targetMask |= (1 << B_SLOTSYNC_WORKER);
>>           /* checkpointer, archiver, stats, and syslogger may continue for now */
>>   
>> +        SignalSomeChildren(SIGTERM, targetMask);
>> +
>>           /* Now transition to PM_WAIT_BACKENDS state to wait for them to die */
>>           pmState = PM_WAIT_BACKENDS;
>> <snip>
> 
> It's likely the right thing to not do as one patch, but IMO this really wants
> to be a state table. Perhaps as part of child_process_kinds, perhaps separate
> from that.

Yeah. I've tried to refactor this into a table before, but didn't come 
up with anything that I was happy with. I also feel there must be a 
better way to organize this, but not sure what exactly. I hope that will 
become more apparent after these other changes.

>> @@ -3130,8 +3047,21 @@ static void
>>   LaunchMissingBackgroundProcesses(void)
>>   {
>>       /* Syslogger is active in all states */
>> -    if (SysLoggerPID == 0 && Logging_collector)
>> -        SysLoggerPID = SysLogger_Start();
>> +    if (SysLoggerPMChild == NULL && Logging_collector)
>> +    {
>> +        SysLoggerPMChild = AssignPostmasterChildSlot(B_LOGGER);
>> +        if (!SysLoggerPMChild)
>> +            elog(LOG, "no postmaster child slot available for syslogger");
> 
> How could this elog() be reached? Seems something seriously would have gone
> wrong to get here - in which case a LOG that might not even be visible (due to
> logger not working) doesn't seem like the right response.

I'll turn it into an assertion or PANIC.

>> @@ -3334,29 +3270,12 @@ SignalSomeChildren(int signal, uint32 targetMask)
>>   static void
>>   TerminateChildren(int signal)
>>   {
> 
> The comment for TerminateChildren() says "except syslogger and dead_end
> backends." - aren't you including the latter here?

The comment is adjusted in 
v4-0004-Introduce-a-separate-BackendType-for-dead-end-chi.patch. Before 
that, SignalChildren() does ignore dead-end children.

Thanks for the review!

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Refactoring postmaster's code to cleanup after child exit

From
Robert Haas
Date:
On Fri, Sep 6, 2024 at 9:13 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> It's currently possible to have up to 2 * max_connections backends in
> the authentication phase. We would have to change that behaviour, or
> make the PGPROC array 2x larger.

I know I already said this elsewhere, but in case it got lost in the
shuffle, +1 for changing this, unless somebody can make a compelling
argument why 2 * max_connections isn't WAY too many.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Refactoring postmaster's code to cleanup after child exit

From
Andres Freund
Date:
Hi,

On 2024-09-06 16:13:43 +0300, Heikki Linnakangas wrote:
> On 04/09/2024 17:35, Andres Freund wrote:
> > On 2024-08-12 12:55:00 +0300, Heikki Linnakangas wrote:
> > >  From dc53f89edbeec99f8633def8aa5f47cd98e7a150 Mon Sep 17 00:00:00 2001
> > > From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> > > Date: Mon, 12 Aug 2024 10:59:04 +0300
> > > Subject: [PATCH v4 4/8] Introduce a separate BackendType for dead-end children
> > >
> > > And replace postmaster.c's own "backend type" codes with BackendType
> >
> > Hm - it seems a bit odd to open-code this when we actually have a "table
> > driven configuration" available?  Why isn't the type a field in
> > child_process_kind?
>
> Sorry, I didn't understand this. What exactly would you add to
> child_process_kind? Where would you use it?

I'm not entirely sure what I was thinking of. It might be partially triggering
a prior complaint I had about manually assigning things to MyBackendType,
despite actually having all the information already.

One thing that I just noticed is that this patch orphans comment references to
BACKEND_TYPE_AUTOVAC and BACKEND_TYPE_BGWORKER.

Seems a tad odd to have BACKEND_TYPE_ALL after removing everything else from
the BACKEND_TYPE_* "namespace".


To deal with the issue around bitmasks you had mentioned, I think we should at
least have a static inline function to convert B_* values to the bitmask
index.


> > > +/*
> > > + * MaxLivePostmasterChildren
> > > + *
> > > + * This reports the number postmaster child processes that can be active.  It
> > > + * includes all children except for dead_end children.  This allows the array
> > > + * in shared memory (PMChildFlags) to have a fixed maximum size.
> > > + */
> > > +int
> > > +MaxLivePostmasterChildren(void)
> > > +{
> > > +    int            n = 0;
> > > +
> > > +    /* We know exactly how mamy worker and aux processes can be active */
> > > +    n += autovacuum_max_workers;
> > > +    n += max_worker_processes;
> > > +    n += NUM_AUXILIARY_PROCS;
> > > +
> > > +    /*
> > > +     * We allow more connections here than we can have backends because some
> > > +     * might still be authenticating; they might fail auth, or some existing
> > > +     * backend might exit before the auth cycle is completed.  The exact
> > > +     * MaxBackends limit is enforced when a new backend tries to join the
> > > +     * shared-inval backend array.
> > > +     */
> > > +    n += 2 * (MaxConnections + max_wal_senders);
> > > +
> > > +    return n;
> > > +}
> >
> > I wonder if we could instead maintain at least some of this in
> > child_process_kinds? Manually listing different types of processes in
> > different places doesn't seem particularly sustainable.
>
> Hmm, you mean adding "max this kind of children" field to
> child_process_kinds? Perhaps.

Yep, that's what I meant.



> > > +/* Return the appropriate free-list for the given backend type */
> > > +static dlist_head *
> > > +GetFreeList(BackendType btype)
> > > +{
> > > +    switch (btype)
> > > +    {
> > > +        case B_BACKEND:
> > > +        case B_BG_WORKER:
> > > +        case B_WAL_SENDER:
> > > +        case B_SLOTSYNC_WORKER:
> > > +            return &freeBackendList;
> >
> > Maybe a daft question - but why are all of these in the same list? Sure,
> > they're essentially all full backends, but they're covered by different GUCs?
>
> No reason. No particular reason they should *not* share the same list either
> though.

Aren't they controlled by distinct connection limits? Isn't there a danger
that we could use up entries and fail connections due to that, despite not
actually being above the limit?


> > Tangential: Why do we need a freelist for these and why do we choose a random
> > pgproc for these instead of assigning one statically?
> >
> > Background: I'd like to not provide AIO workers with "bounce buffers" (for IO
> > of buffers that can't be done in-place, like writes when checksums are
> > enabled). The varying proc numbers make that harder than it'd have to be...
>
> Yeah, we can make these fixed.

Cool.


> Currently, the # of slots reserved for aux processes is sized by
> NUM_AUXILIARY_PROCS, which is one smaller than the number of different aux
> proces kinds:

> > /*
> >  * We set aside some extra PGPROC structures for auxiliary processes,
> >  * ie things that aren't full-fledged backends but need shmem access.
> >  *
> >  * Background writer, checkpointer, WAL writer, WAL summarizer, and archiver
> >  * run during normal operation.  Startup process and WAL receiver also consume
> >  * 2 slots, but WAL writer is launched only after startup has exited, so we
> >  * only need 6 slots.
> >  */
> > #define NUM_AUXILIARY_PROCS        6
>
> For PMChildSlot numbers, we could certainly just allocate one more slot.
>
> It would probably make sense for PGPROCs too, even though PGPROC is a much
> larger struct.

I don't think it's worth worrying about that much. PGPROC is large, but not
*that* large. And the robustness win of properly detecting when there's a
problem around starting/stopping aux workers seems to outweigh that to me.



> > > +PMChild *
> > > +AssignPostmasterChildSlot(BackendType btype)
> > > +{
> > > +    dlist_head *freelist;
> > > +    PMChild    *pmchild;
> > > +
> > > +    freelist = GetFreeList(btype);
> > > +
> > > +    if (dlist_is_empty(freelist))
> > > +        return NULL;
> > > +
> > > +    pmchild = dlist_container(PMChild, elem, dlist_pop_head_node(freelist));
> > > +    pmchild->pid = 0;
> > > +    pmchild->bkend_type = btype;
> > > +    pmchild->rw = NULL;
> > > +    pmchild->bgworker_notify = true;
> > > +
> > > +    /*
> > > +     * pmchild->child_slot for each entry was initialized when the array of
> > > +     * slots was allocated.
> > > +     */
> > > +
> > > +    dlist_push_head(&ActiveChildList, &pmchild->elem);
> > > +
> > > +    ReservePostmasterChildSlot(pmchild->child_slot);
> > > +
> > > +    /* FIXME: find a more elegant way to pass this */
> > > +    MyPMChildSlot = pmchild->child_slot;
> >
> > What if we assigned one offset for each process and assigned its ID here and
> > also used that for its ProcNumber - that way we wouldn't need to manage
> > freelists in two places.
>
> It's currently possible to have up to 2 * max_connections backends in the
> authentication phase. We would have to change that behaviour, or make the
> PGPROC array 2x larger.

That however, might be too much...


> It might well be worth it, I don't know how sensible the current behaviour
> is. But I'd like to punt that to later patch, to keep the scope of this
> patch set reasonable. It's pretty straightforward to do later on top of this
> if we want to.

Makes sense.


I still think that we'd be better off to just return an error to the client in
postmaster, rather than deal with this dead-end children mess. That was
perhaps justified at some point, but now it seems to add way more complexity
than it's worth. And it's absurdly expensive to fork to return an error. Way
more expensive than just having postmaster send an error and close the socket.


> > > @@ -2469,11 +2410,15 @@ process_pm_child_exit(void)
> > >           }
> > >           /* Was it the system logger?  If so, try to start a new one */
> > > -        if (pid == SysLoggerPID)
> > > +        if (SysLoggerPMChild && pid == SysLoggerPMChild->pid)
> > >           {
> > > -            SysLoggerPID = 0;
> > >               /* for safety's sake, launch new logger *first* */
> > > -            SysLoggerPID = SysLogger_Start();
> > > +            SysLoggerPMChild->pid = SysLogger_Start();
> > > +            if (SysLoggerPMChild->pid == 0)
> > > +            {
> > > +                FreePostmasterChildSlot(SysLoggerPMChild);
> > > +                SysLoggerPMChild = NULL;
> > > +            }
> > >               if (!EXIT_STATUS_0(exitstatus))
> > >                   LogChildExit(LOG, _("system logger process"),
> >
> > Seems a bit weird to have one place with a different memory lifetime handling
> > than other places. Why don't we just do this the same way as in other places
> > but continue to defer the logging until after we tried to start the new
> > logger?
>
> Hmm, you mean let LaunchMissingBackgroundProcesses() handle the restart?

Yea - which it already can do, presumably to handle the case of
logging_collector. It just seems odd to have code to have three places calling
SysLogger_Start() - with some mild variations of the code.

Perhaps we can at least centralize some of that?


But you have a point with:

> I'm a little scared of changing the existing logic. We don't have a
> mechanism for deferring logging, so we would have to invent that, or the
> logs would just accumulate in the pipe until syslogger starts up. There's
> some code between here and LaunchMissingBackgroundProcesses(), so might
> postmaster get blocked between writing to the syslogger pipe, before having
> restarted it?
>
> If forking the syslogger process fails, that can happen anyway, though.




> > >       /* Construct a process name for log message */
> > > +
> > > +    /*
> > > +     * FIXME: use GetBackendTypeDesc here? How does the localization of that
> > > +     * work?
> > > +     */
> > >       if (bp->bkend_type == B_DEAD_END_BACKEND)
> > >       {
> > >           procname = _("dead end backend");
> >
> > Might be worth having a version of GetBackendTypeDesc() that returns a
> > translated string?
>
> Constructing the string for background workers is a little more complicated:

Random aside: I *hate* that there's no trivial way recognie background workers
in pg_stat_activity, because somebody made pg_stat_activity.backend_type
report something completely under control of extensions...




> > > @@ -2697,9 +2643,16 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
> > >       {
> > >           dlist_iter    iter;
> > > -        dlist_foreach(iter, &BackendList)
> > > +        dlist_foreach(iter, &ActiveChildList)
> > >           {
> > > -            Backend    *bp = dlist_container(Backend, elem, iter.cur);
> > > +            PMChild    *bp = dlist_container(PMChild, elem, iter.cur);
> > > +
> > > +            /* We do NOT restart the syslogger */
> > > +            if (bp == SysLoggerPMChild)
> > > +                continue;
> >
> > That comment seems a bit misleading - we do restart syslogger, we just don't
> > do it here, no?  I realize it's an old comment, but it still seems like it's
> > worth fixing given that you touch all the code here...
>
> No, we really do not restart the syslogger.

Hm?

        /* Was it the system logger?  If so, try to start a new one */
        if (SysLoggerPMChild && pid == SysLoggerPMChild->pid)
        {
            /* for safety's sake, launch new logger *first* */
            SysLoggerPMChild->pid = SysLogger_Start(SysLoggerPMChild->child_slot);
            if (SysLoggerPMChild->pid == 0)
            {
                FreePostmasterChildSlot(SysLoggerPMChild);
                SysLoggerPMChild = NULL;
            }
            if (!EXIT_STATUS_0(exitstatus))
                LogChildExit(LOG, _("system logger process"),
                             pid, exitstatus);
            continue;
        }

We don't do it reaction to other processes crashing, but we still restart it
if it dies. Perhaps it's clear from context - but I had to think aobut it for
a moment.


>
> > > @@ -2871,29 +2786,27 @@ PostmasterStateMachine(void)
> > > <snip>
> > > -        if (WalSummarizerPID != 0)
> > > -            signal_child(WalSummarizerPID, SIGTERM);
> > > -        if (SlotSyncWorkerPID != 0)
> > > -            signal_child(SlotSyncWorkerPID, SIGTERM);
> > > +        targetMask |= (1 << B_STARTUP);
> > > +        targetMask |= (1 << B_WAL_RECEIVER);
> > > +
> > > +        targetMask |= (1 << B_WAL_SUMMARIZER);
> > > +        targetMask |= (1 << B_SLOTSYNC_WORKER);
> > >           /* checkpointer, archiver, stats, and syslogger may continue for now */
> > > +        SignalSomeChildren(SIGTERM, targetMask);
> > > +
> > >           /* Now transition to PM_WAIT_BACKENDS state to wait for them to die */
> > >           pmState = PM_WAIT_BACKENDS;
> > > <snip>
> >
> > It's likely the right thing to not do as one patch, but IMO this really wants
> > to be a state table. Perhaps as part of child_process_kinds, perhaps separate
> > from that.
>
> Yeah. I've tried to refactor this into a table before, but didn't come up
> with anything that I was happy with. I also feel there must be a better way
> to organize this, but not sure what exactly. I hope that will become more
> apparent after these other changes.

What I'm imagining is something like:
1) Make PMState values each have a distinct bit
2) Move PMState to some (new?) header
3) Add a "uint32 should_run" member to child_process_kind that's a bitmask of
   PMStates
4) Add a new function in launch_backend.c that gets passed the "target"
   PMState and returns a bitmask of the tasks that should be running (or the
   inverse, doesn't really matter).
5) Instead of open-coding the targetMask "computation", use the new function
   from 4).

I think that might not look too bad?

Greetings,

Andres Freund



Re: Refactoring postmaster's code to cleanup after child exit

From
Robert Haas
Date:
On Tue, Sep 10, 2024 at 12:59 PM Andres Freund <andres@anarazel.de> wrote:
> I still think that we'd be better off to just return an error to the client in
> postmaster, rather than deal with this dead-end children mess. That was
> perhaps justified at some point, but now it seems to add way more complexity
> than it's worth. And it's absurdly expensive to fork to return an error. Way
> more expensive than just having postmaster send an error and close the socket.

The tricky case is the one where the client write() -- or SSL_write() -- blocks.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Refactoring postmaster's code to cleanup after child exit

From
Andres Freund
Date:
Hi,

On 2024-08-12 12:55:00 +0300, Heikki Linnakangas wrote:
> @@ -2864,6 +2777,8 @@ PostmasterStateMachine(void)
>       */
>      if (pmState == PM_STOP_BACKENDS)
>      {
> +        uint32        targetMask;
> +
>          /*
>           * Forget any pending requests for background workers, since we're no
>           * longer willing to launch any new workers.  (If additional requests
> @@ -2871,29 +2786,27 @@ PostmasterStateMachine(void)
>           */
>          ForgetUnstartedBackgroundWorkers();
>  
> -        /* Signal all backend children except walsenders and dead-end backends */
> -        SignalSomeChildren(SIGTERM,
> -                           BACKEND_TYPE_ALL & ~(1 << B_WAL_SENDER | 1 << B_DEAD_END_BACKEND));
> +        /* Signal all backend children except walsenders */
> +        /* dead-end children are not signalled yet */
> +        targetMask = (1 << B_BACKEND);
> +        targetMask |= (1 << B_BG_WORKER);
> +
>          /* and the autovac launcher too */
> -        if (AutoVacPID != 0)
> -            signal_child(AutoVacPID, SIGTERM);
> +        targetMask |= (1 << B_AUTOVAC_LAUNCHER);
>          /* and the bgwriter too */
> -        if (BgWriterPID != 0)
> -            signal_child(BgWriterPID, SIGTERM);
> +        targetMask |= (1 << B_BG_WRITER);
>          /* and the walwriter too */
> -        if (WalWriterPID != 0)
> -            signal_child(WalWriterPID, SIGTERM);
> +        targetMask |= (1 << B_WAL_WRITER);
>          /* If we're in recovery, also stop startup and walreceiver procs */
> -        if (StartupPID != 0)
> -            signal_child(StartupPID, SIGTERM);
> -        if (WalReceiverPID != 0)
> -            signal_child(WalReceiverPID, SIGTERM);
> -        if (WalSummarizerPID != 0)
> -            signal_child(WalSummarizerPID, SIGTERM);
> -        if (SlotSyncWorkerPID != 0)
> -            signal_child(SlotSyncWorkerPID, SIGTERM);
> +        targetMask |= (1 << B_STARTUP);
> +        targetMask |= (1 << B_WAL_RECEIVER);
> +
> +        targetMask |= (1 << B_WAL_SUMMARIZER);
> +        targetMask |= (1 << B_SLOTSYNC_WORKER);
>          /* checkpointer, archiver, stats, and syslogger may continue for now */
>  
> +        SignalSomeChildren(SIGTERM, targetMask);
> +
>          /* Now transition to PM_WAIT_BACKENDS state to wait for them to die */
>          pmState = PM_WAIT_BACKENDS;
>      }

I think this might now omit shutting down at least autovac workers, which
afaict previously were included.

Greetings,

Andres Freund



Re: Refactoring postmaster's code to cleanup after child exit

From
Andres Freund
Date:
Hi,

On 2024-09-10 13:33:36 -0400, Robert Haas wrote:
> On Tue, Sep 10, 2024 at 12:59 PM Andres Freund <andres@anarazel.de> wrote:
> > I still think that we'd be better off to just return an error to the client in
> > postmaster, rather than deal with this dead-end children mess. That was
> > perhaps justified at some point, but now it seems to add way more complexity
> > than it's worth. And it's absurdly expensive to fork to return an error. Way
> > more expensive than just having postmaster send an error and close the socket.
> 
> The tricky case is the one where the client write() -- or SSL_write() -- blocks.

Yea, SSL definitely does make it harder. But it's not exactly rocket science
to do non-blocking SSL connection establishment. After all, we do manage to
do so in libpq...

Greetings,

Andres Freund



Re: Refactoring postmaster's code to cleanup after child exit

From
Thomas Munro
Date:
On Sat, Oct 5, 2024 at 7:41 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> My test for dead-end backends opens 20 TCP (or unix domain) connections
> to the server, in quick succession. That works fine my system, and it
> passed cirrus CI on other platforms, but on FreeBSD it failed
> repeatedly. The behavior in that scenario is apparently
> platform-dependent: it depends on the accept queue size, but what
> happens when you reach the queue size also seems to depend on the
> platform. On my Linux system, the connect() calls in the client are
> blocked, if the server is doesn't call accept() fast enough, but
> apparently you get an error on *BSD systems.

Right, we've analysed that difference in AF_UNIX implementation
before[1], which shows up in the real world, where client sockets ie
libpq's are usually non-blocking, as EAGAIN on Linux (which is not
valid per POSIX) vs ECONNREFUSED on other OSes.  All fail to connect,
but the error message is different.

For blocking AF_UNIX client sockets like in your test, Linux
effectively has an infinite queue made from two layers.  The listen
queue (a queue of connecting sockets) does respect the requested
backlog size, but when it's full it has an extra trick: the connect()
call waits (in a queue of threads) for space to become free in the
listen queue, so it's effectively unlimited (but only for blocking
sockets), while FreeBSD and I suspect any other implementation
deriving from or reimplementing the BSD socket code gives you
ECONNREFUSED.  macOS behaves just the same as FreeBSD AFAICT, so I
don't know why you didn't see the same thing... I guess it's just
racing against accept() draining the queue.

It's possible that Windows copied the Linux behaviour for AF_UNIX,
given that it probably has something to do with the WSL project for
emulating Linux, but IDK.

[1] https://www.postgresql.org/message-id/flat/CADc_NKg2d%2BoZY9mg4DdQdoUcGzN2kOYXBu-3--RW_hEe0tUV%3Dg%40mail.gmail.com

> I'm not sure of the exact details, but in any case, platform-dependent
> behavior needs to be avoided in tests. So I changed the test so that it
> sends an SSLRequest packet on each connection and waits for reply (which
> is always 'N' to reject it in this test), before opening the next
> connection. This way, each connection is still left hanging, which is
> what I want in this test, but only after postmaster has successfully
> accept()ed it and forked the backend.

Makes sense.



Re: Refactoring postmaster's code to cleanup after child exit

From
Heikki Linnakangas
Date:
On 05/10/2024 01:03, Thomas Munro wrote:
> On Sat, Oct 5, 2024 at 7:41 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> My test for dead-end backends opens 20 TCP (or unix domain) connections
>> to the server, in quick succession. That works fine my system, and it
>> passed cirrus CI on other platforms, but on FreeBSD it failed
>> repeatedly. The behavior in that scenario is apparently
>> platform-dependent: it depends on the accept queue size, but what
>> happens when you reach the queue size also seems to depend on the
>> platform. On my Linux system, the connect() calls in the client are
>> blocked, if the server is doesn't call accept() fast enough, but
>> apparently you get an error on *BSD systems.
> 
> Right, we've analysed that difference in AF_UNIX implementation
> before[1], which shows up in the real world, where client sockets ie
> libpq's are usually non-blocking, as EAGAIN on Linux (which is not
> valid per POSIX) vs ECONNREFUSED on other OSes.  All fail to connect,
> but the error message is different.

Thanks for the pointer!

> For blocking AF_UNIX client sockets like in your test, Linux
> effectively has an infinite queue made from two layers.  The listen
> queue (a queue of connecting sockets) does respect the requested
> backlog size, but when it's full it has an extra trick: the connect()
> call waits (in a queue of threads) for space to become free in the
> listen queue, so it's effectively unlimited (but only for blocking
> sockets), while FreeBSD and I suspect any other implementation
> deriving from or reimplementing the BSD socket code gives you
> ECONNREFUSED.  macOS behaves just the same as FreeBSD AFAICT, so I
> don't know why you didn't see the same thing... I guess it's just
> racing against accept() draining the queue.

In fact I misremembered: the failure happened on macOS, *not* FreeBSD. 
It could be just luck I didn't see it on FreeBSD though.

> It's possible that Windows copied the Linux behaviour for AF_UNIX,
> given that it probably has something to do with the WSL project for
> emulating Linux, but IDK.

Sadly Windows' IO::Socket::UNIX hasn't been implemented on Windows (or 
at least on this perl distribution we're using in Cirrus CI):

Socket::pack_sockaddr_un not implemented on this architecture at 
C:/strawberry/5.26.3.1/perl/lib/Socket.pm line 872.

so I'll have to disable this test on Windows anyway.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Refactoring postmaster's code to cleanup after child exit

From
Dagfinn Ilmari Mannsåker
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:

> On 05/10/2024 01:03, Thomas Munro wrote:
>
>> It's possible that Windows copied the Linux behaviour for AF_UNIX,
>> given that it probably has something to do with the WSL project for
>> emulating Linux, but IDK.
>
> Sadly Windows' IO::Socket::UNIX hasn't been implemented on Windows (or
> at least on this perl distribution we're using in Cirrus CI):
>
> Socket::pack_sockaddr_un not implemented on this architecture at
> C:/strawberry/5.26.3.1/perl/lib/Socket.pm line 872.
>
> so I'll have to disable this test on Windows anyway.

Socket version 2.028 (included in Perl 5.32) provides pack_sockaddr_un()
on Windows, so that can be fixed by bumping the Perl version in
https://github.com/anarazel/pg-vm-images/blob/main/packer/windows.pkr.hcl
to something more modern (such as 5.40.0.1), and only skipping the test
if on Windows if Socket is too old.

The decision to use 5.26 seems to come from the initial creation of the
CI images in 2021 (when 5.34 was current), with the comment «newer
versions don't currently work correctly for plperl».  That claim is
worth revisiting, and fixing if it's still the case.


- ilmari




Re: Refactoring postmaster's code to cleanup after child exit

From
Andres Freund
Date:
Hi,

On 2024-10-05 20:51:50 +0100, Dagfinn Ilmari Mannsåker wrote:
> Socket version 2.028 (included in Perl 5.32) provides pack_sockaddr_un()
> on Windows, so that can be fixed by bumping the Perl version in
> https://github.com/anarazel/pg-vm-images/blob/main/packer/windows.pkr.hcl
> to something more modern (such as 5.40.0.1), and only skipping the test
> if on Windows if Socket is too old.
> 
> The decision to use 5.26 seems to come from the initial creation of the
> CI images in 2021 (when 5.34 was current), with the comment «newer
> versions don't currently work correctly for plperl».  That claim is
> worth revisiting, and fixing if it's still the case.

I think we fixed the issues that were known at the time. I think I tried
upgrading to something newer at some point and there were some weird, but
fixable, encoding issues. Unfortunately I don't have the bandwidth to tackle
this rn.

Greetings,

Andres Freund



Re: Refactoring postmaster's code to cleanup after child exit

From
Nazir Bilal Yavuz
Date:
Hi,

On Tue, 8 Oct 2024 at 00:55, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 05/10/2024 22:51, Dagfinn Ilmari Mannsåker wrote:
> > Heikki Linnakangas <hlinnaka@iki.fi> writes:
> >> Sadly Windows' IO::Socket::UNIX hasn't been implemented on Windows (or
> >> at least on this perl distribution we're using in Cirrus CI):
> >>
> >> Socket::pack_sockaddr_un not implemented on this architecture at
> >> C:/strawberry/5.26.3.1/perl/lib/Socket.pm line 872.
> >>
> >> so I'll have to disable this test on Windows anyway.
> >
> > Socket version 2.028 (included in Perl 5.32) provides pack_sockaddr_un()
> > on Windows, so that can be fixed by bumping the Perl version in
> > https://github.com/anarazel/pg-vm-images/blob/main/packer/windows.pkr.hcl
> > to something more modern (such as 5.40.0.1), and only skipping the test
> > if on Windows if Socket is too old.
> >
> > The decision to use 5.26 seems to come from the initial creation of the
> > CI images in 2021 (when 5.34 was current), with the comment «newer
> > versions don't currently work correctly for plperl».  That claim is
> > worth revisiting, and fixing if it's still the case.
>
> Yeah, it would be nice to update it. I wonder if commit
> 341f4e002d461a3c5513cb864490cddae2b43a64 fixed whatever the problem was.

The perl version in Windows CI image is bumped to 5.40.0.1 [1]. So,
the related test passes on Windows now [2].

[1] https://github.com/anarazel/pg-vm-images/commit/cbd5d46f2fb7b28efb126ddac64d12711247dfa8
[2] https://cirrus-ci.com/task/5682393120505856?logs=check_world#L241

--
Regards,
Nazir Bilal Yavuz
Microsoft



Re: Refactoring postmaster's code to cleanup after child exit

From
Heikki Linnakangas
Date:
On 09/10/2024 23:40, Heikki Linnakangas wrote:
> I pushed the first three patches, with the new test and one of the small
> refactoring patches. Thanks for all the comments so far! Here is a new
> version of the remaining patches.
> 
> Lots of little cleanups and changes here and there since the last
> versions, but the notable bigger changes are:
> 
> - There is now a BackendTypeMask datatype, so that if you try to mix up
> bitmasks and plain BackendType values, the compiler will complain.
> 
> - pmchild.c has been rewritten per feedback, so that the "pools" of
> PMChild structs are more explicit. The size of each pool is only stated
> once, whereas before the same logic was duplicated in
> MaxLivePostmasterChildren() which calculates the number of slots and in
> InitPostmasterChildSlots() which allocates them.
> 
> - In PostmasterStateMachine(), I combined the code to handle
> PM_STOP_BACKENDS and PM_WAIT_BACKENDS. They are essentially the same
> state, except that PM_STOP_BACKENDS first sends the signal to all the
> child processes that it will then wait for. They both needed to build
> the same bitmask of processes to signal or wait for; this eliminates the
> duplication.

Made a few more changes since last patch version:

- Fixed initialization in pmchild.c in single-user and bootstrapping mode
- inlined assign_backendlist_entry() into its only caller; it wasn't 
doing much anymore
- cleaned up some leftovers in canAcceptConnections()
- Renamed some functions for clarity, fixed some leftover comments that 
still talked about Backend structs and BackendList

With those changes, committed. Thanks for the review!

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Refactoring postmaster's code to cleanup after child exit

From
Tomas Vondra
Date:
On 11/14/24 15:13, Heikki Linnakangas wrote:
> On 09/10/2024 23:40, Heikki Linnakangas wrote:
>> I pushed the first three patches, with the new test and one of the small
>> refactoring patches. Thanks for all the comments so far! Here is a new
>> version of the remaining patches.
>>
Hi, the TAP test 001_connection_limits.pl introduced by 6a1d0d470e84
seems to have problems with valgrind :-( I reliably get this failure:


t/001_connection_limits.pl .. 3/? # Tests were run but no plan was
declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 4.
t/001_connection_limits.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00)
All 4 subtests passed


and tmp_check/log/regress_log_001_connection_limits says:


[23:48:44.444](1.129s) ok 3 - reserved_connections limit
[23:48:44.445](0.001s) ok 4 - reserved_connections limit: matches
process ended prematurely at
/home/user/work/postgres/src/test/postmaster/../../../src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
line 154.
# Postmaster PID for node "primary" is 198592


That BackgroundPsql.pm line is this in wait_connect()

  $self->{run}->pump()
    until $self->{stdout} =~ /$banner/ || $self->{timeout}->is_expired;

By trial and error I found that it fails on this line 70:

  push(@sessions, background_psql_as_user('regress_superuser'));

but I have no idea idea why. There are multiple similar calls a couple
lines earlier, and those work fine. And various other TAP tests with
background_sql() work fine too.

So what's so special about this particular line?


regards

-- 
Tomas Vondra




Re: Refactoring postmaster's code to cleanup after child exit

From
Heikki Linnakangas
Date:
On 09/12/2024 01:12, Tomas Vondra wrote:
> On 11/14/24 15:13, Heikki Linnakangas wrote:
>> On 09/10/2024 23:40, Heikki Linnakangas wrote:
>>> I pushed the first three patches, with the new test and one of the small
>>> refactoring patches. Thanks for all the comments so far! Here is a new
>>> version of the remaining patches.
>>>
> Hi, the TAP test 001_connection_limits.pl introduced by 6a1d0d470e84
> seems to have problems with valgrind :-( I reliably get this failure:

How exactly do you run the test with valgrind? What platform?

It works for me, with this:

(cd build && ninja && rm -rf tmp_install && meson test --suite setup && 
valgrind --leak-check=no --gen-suppressions=all 
--suppressions=/home/heikki/git-sandbox/postgresql/src/tools/valgrind.supp 
--time-stamp=yes 
--error-markers=VALGRINDERROR-BEGIN,VALGRINDERROR-END 
--log-file=$HOME/pg-valgrind/%p.log --trace-children=yes meson test 
--suite postmaster )

> t/001_connection_limits.pl .. 3/? # Tests were run but no plan was
> declared and done_testing() was not seen.
> # Looks like your test exited with 29 just after 4.
> t/001_connection_limits.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00)
> All 4 subtests passed
> 
> 
> and tmp_check/log/regress_log_001_connection_limits says:
> 
> 
> [23:48:44.444](1.129s) ok 3 - reserved_connections limit
> [23:48:44.445](0.001s) ok 4 - reserved_connections limit: matches
> process ended prematurely at
> /home/user/work/postgres/src/test/postmaster/../../../src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
> line 154.
> # Postmaster PID for node "primary" is 198592
> 
> 
> That BackgroundPsql.pm line is this in wait_connect()
> 
>    $self->{run}->pump()
>      until $self->{stdout} =~ /$banner/ || $self->{timeout}->is_expired;
> 
> By trial and error I found that it fails on this line 70:
> 
>    push(@sessions, background_psql_as_user('regress_superuser'));
> 
> but I have no idea idea why. There are multiple similar calls a couple
> lines earlier, and those work fine. And various other TAP tests with
> background_sql() work fine too.
> 
> So what's so special about this particular line?

Weird. Valgrind makes everything slow; is it a timeout? Any other clues 
in the logs?

-- 
Heikki Linnakangas
Neon (https://neon.tech)



Re: Refactoring postmaster's code to cleanup after child exit

From
Heikki Linnakangas
Date:
On 09/12/2024 14:47, Tomas Vondra wrote:
> On 12/9/24 13:30, Heikki Linnakangas wrote:
>> On 09/12/2024 01:12, Tomas Vondra wrote:
>>> On 11/14/24 15:13, Heikki Linnakangas wrote:
>>>> On 09/10/2024 23:40, Heikki Linnakangas wrote:
>>>>> I pushed the first three patches, with the new test and one of the
>>>>> small
>>>>> refactoring patches. Thanks for all the comments so far! Here is a new
>>>>> version of the remaining patches.
>>>>>
>>> Hi, the TAP test 001_connection_limits.pl introduced by 6a1d0d470e84
>>> seems to have problems with valgrind :-( I reliably get this failure:
>>
>> How exactly do you run the test with valgrind? What platform?
>>
> 
> It failed for me on both amd64 (Fedora 41) and rpi5 32/64-bit (Debian).
> 
>> It works for me, with this:
>>
>> (cd build && ninja && rm -rf tmp_install && meson test --suite setup &&
>> valgrind --leak-check=no --gen-suppressions=all --suppressions=/home/
>> heikki/git-sandbox/postgresql/src/tools/valgrind.supp --time-stamp=yes
>> --error-markers=VALGRINDERROR-BEGIN,VALGRINDERROR-END --log-file=$HOME/
>> pg-valgrind/%p.log --trace-children=yes meson test --suite postmaster )
>>
> 
> I have a patch that tweaks pg_ctl/pg_regress to execute valgrind, so I
> just do
> 
> ./configure --enable-debug --prefix=/home/user/builds/master
> --enable-depend --enable-cassert --enable-tap-tests CPPFLAGS="-O0 -ggdb3
> -DUSE_VALGRIND"
> 
> and then the usual "make check" or whatever.
> 
> The patch has a hardcoded path to the .supp file, and places the
> valgrind log into /tmp. It has worked for me fine up until that commit,
> and it still seems to be working in every other test directory.

Ok, I was able to reproduce this with that setup.

Unsurprisingly, it's a timing issue. It can be reproduced without 
valgrind by adding this delay:

diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 289059435a9..1eb6bad72ca 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -583,6 +583,7 @@ errfinish(const char *filename, int lineno, const 
char *funcname)
           * FATAL termination.  The postmaster may or may not consider this
           * worthy of panic, depending on which subprocess returns it.
           */
+        sleep(1);
          proc_exit(1);
      }

The test opens a connection that is expected to fail with the "remaining 
connection slots are reserved for roles with the SUPERUSER attribute" 
error. Right after that, it opens a new connection as superuser, and 
expects it to succeed. But if the previous backend hasn't exited yet, 
the new connection fails with "too many clients already".

Not sure how to fix this. A small sleep in the test would work, but in 
principle there's no delay that's guaranteed to be enough. A more robust 
solution would be to run a "select count(*) from pg_stat_activity" and 
wait until the number of connections are what's expected. I'll try that 
and see how complicated that gets..

-- 
Heikki Linnakangas
Neon (https://neon.tech)



Re: Refactoring postmaster's code to cleanup after child exit

From
Heikki Linnakangas
Date:
On 09/12/2024 22:55, Heikki Linnakangas wrote:
> Not sure how to fix this. A small sleep in the test would work, but in 
> principle there's no delay that's guaranteed to be enough. A more robust 
> solution would be to run a "select count(*) from pg_stat_activity" and 
> wait until the number of connections are what's expected. I'll try that 
> and see how complicated that gets..

Checking pg_stat_activity doesn't help, because the backend doesn't 
register itself in pg_stat_activity until later. A connection that's 
rejected due to connection limits never shows up in pg_stat_activity.

Some options:

0. Do nothing

1. Add a small sleep to the test

2. Move the pgstat_bestart() call earlier in the startup sequence, so 
that a backend shows up in pg_stat_activity before it acquires a PGPROC 
entry, and stays visible until after it has released its PGPROC entry. 
This would give more visibility to backends that are starting up.

3. Rearrange the FATAL error handling so that the process removes itself 
from PGPROC before sending the error to the client. That would be kind 
of nice anyway. Currently, if sending the rejection error message to the 
client blocks, you are holding up a PGPROC slot until the message is 
sent. The error message packet is short, so it's highly unlikely to 
block, but still.

Option 3 seems kind of nice in principle, but looking at the code, it's 
a bit awkward to implement.  Easiest way to implement it would be to 
modify send_message_to_frontend() to not call pq_flush() on FATAL 
errors, and flush the data in socket_close() instead. Not a lot of code, 
but it's a pretty ugly special case.

Option 2 seems nice too, but seems like a lot of work.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Refactoring postmaster's code to cleanup after child exit

From
Tomas Vondra
Date:

On 12/10/24 11:00, Heikki Linnakangas wrote:
> On 09/12/2024 22:55, Heikki Linnakangas wrote:
>> Not sure how to fix this. A small sleep in the test would work, but in
>> principle there's no delay that's guaranteed to be enough. A more
>> robust solution would be to run a "select count(*) from
>> pg_stat_activity" and wait until the number of connections are what's
>> expected. I'll try that and see how complicated that gets..
> 
> Checking pg_stat_activity doesn't help, because the backend doesn't
> register itself in pg_stat_activity until later. A connection that's
> rejected due to connection limits never shows up in pg_stat_activity.
> 
> Some options:
> 
> 0. Do nothing
> 
> 1. Add a small sleep to the test
> 

I'd just add a short sleep. Yeah, sleeps are not great, but everything
else seems like a lot of effort just to make this one test pass under
valgrind, and I don't think it's worth it.

Can we make the sleep conditional on valgrind, so that regular builds
are not affected? I guess regular builds could fail too, but I don't
think we've seen such failures until now.


regards

-- 
Tomas Vondra