Thread: Refactoring postmaster's code to cleanup after child exit
Reading through postmaster code, I spotted some refactoring opportunities to make it slightly more readable. Currently, when a child process exits, the postmaster first scans through BackgroundWorkerList to see if it was a bgworker process. If not found, it scans through the BackendList to see if it was a backend process (which it really should be then). That feels a bit silly, because every running background worker process also has an entry in BackendList. There's a lot of duplication between CleanupBackgroundWorker and CleanupBackend. Before commit 8a02b3d732, we used to created Backend entries only for background worker processes that connected to a database, not for other background worker processes. I think that's why we have the code structure we have. But now that we have a Backend entry for all bgworker processes, it's more natural to have single function to deal with both regular backends and bgworkers. So I came up with the attached patches. This doesn't make any meaningful user-visible changes, except for some incidental changes in log messages (see commit message for details). -- Heikki Linnakangas Neon (https://neon.tech)
Attachment
On 06/07/2024 22:01, Heikki Linnakangas wrote: > Reading through postmaster code, I spotted some refactoring > opportunities to make it slightly more readable. > > Currently, when a child process exits, the postmaster first scans > through BackgroundWorkerList to see if it was a bgworker process. If not > found, it scans through the BackendList to see if it was a backend > process (which it really should be then). That feels a bit silly, > because every running background worker process also has an entry in > BackendList. There's a lot of duplication between > CleanupBackgroundWorker and CleanupBackend. > > Before commit 8a02b3d732, we used to created Backend entries only for > background worker processes that connected to a database, not for other > background worker processes. I think that's why we have the code > structure we have. But now that we have a Backend entry for all bgworker > processes, it's more natural to have single function to deal with both > regular backends and bgworkers. > > So I came up with the attached patches. This doesn't make any meaningful > user-visible changes, except for some incidental changes in log messages > (see commit message for details). New patch version attached. Fixed conflicts with recent commits, no other changes. -- Heikki Linnakangas Neon (https://neon.tech)
Attachment
I committed the first two trivial patches, and have continued to work on postmaster.c, and how it manages all the child processes. This is a lot of patches. They're built on top of each other, because that's the order I developed them in, but they probably could be applied in different order too. Please help me by reviewing these, before the stack grows even larger :-). Even partial reviews would be very helpful. I suggest to start reading them in order, and when you get tired, just send any comments you have up to that point. * v3-0001-Make-BackgroundWorkerList-doubly-linked.patch This is the same refactoring patch I started this thread with. * v3-0003-Fix-comment-on-processes-being-kept-over-a-restar.patch * v3-0004-Consolidate-postmaster-code-to-launch-background-.patch Little refactoring of how postmaster launches the background processes. * v3-0005-Add-test-for-connection-limits.patch * v3-0006-Add-test-for-dead-end-backends.patch A few new TAP tests for dead-end backends and enforcing connection limits. We didn't have much coverage for these before. * v3-0007-Use-an-shmem_exit-callback-to-remove-backend-from.patch * v3-0008-Introduce-a-separate-BackendType-for-dead-end-chi.patch Some preliminary refactoring towards patch v3-0010-Assign-a-child-slot-to-every-postmaster-child-pro.patch * v3-0009-Kill-dead-end-children-when-there-s-nothing-else-.patch I noticed that we never send SIGTERM or SIGQUIT to dead-end backends, which seems silly. If the server is shutting down, dead-end backends might prevent the shutdown from completing. Dead-end backends will expire after authentication_timoeut (default 60s), so it won't last for too long, but still seems like we should kill dead-end backends if they're the only children preventing shutdown from completing. * 3-0010-Assign-a-child-slot-to-every-postmaster-child-pro.patch This is what I consider the main patch in this series. Currently, only regular backens, bgworkers and autovacuum workers have a PMChildFlags slot, which is used to detect when a postmaster child exits in an unclean way (in addition to the exit code). This patch assigns a child slot for all processes, except for dead-end backends. That includes all the aux processes. While we're at it, I created separate pools of child slots for different kinds of backends, which fixes the issue that opening a lot of client connections can exhaust all the slots, so that background workers or autovacuum workers cannot start either [1]. [1] https://www.postgresql.org/message-id/55d2f50c-0b81-4b33-b202-cd2a406d69a3%40iki.fi * v3-0011-Pass-MyPMChildSlot-as-an-explicit-argument-to-chi.patch One more little refactoring, to pass MyPMChildSlot to the child process differently. Where is all this leading? I'm not sure exactly, but having a postmaster child slot for every postmaster child seems highly useful. We could move the ProcSignal machinery to use those slot numbers for the indexes to the ProcSignal array, instead of ProcSignal, for example. That would allow all processes to participate in the signalling, even before they have a PGPROC entry. (Or with Thomas's interrupts refactoring, the interrupts array). With the multithreading work, PMChild struct could store a thread id, or whatever is needed for threads to communicate with each other. In any case, seems like it will come handy. -- Heikki Linnakangas Neon (https://neon.tech)
Attachment
- v3-0001-Make-BackgroundWorkerList-doubly-linked.patch
- v3-0002-Refactor-code-to-handle-death-of-a-backend-or-bgw.patch
- v3-0003-Fix-comment-on-processes-being-kept-over-a-restar.patch
- v3-0004-Consolidate-postmaster-code-to-launch-background-.patch
- v3-0005-Add-test-for-connection-limits.patch
- v3-0006-Add-test-for-dead-end-backends.patch
- v3-0007-Use-an-shmem_exit-callback-to-remove-backend-from.patch
- v3-0008-Introduce-a-separate-BackendType-for-dead-end-chi.patch
- v3-0009-Kill-dead-end-children-when-there-s-nothing-else-.patch
- v3-0010-Assign-a-child-slot-to-every-postmaster-child-pro.patch
- v3-0011-Pass-MyPMChildSlot-as-an-explicit-argument-to-chi.patch
On Fri, Aug 2, 2024 at 11:57 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > * v3-0001-Make-BackgroundWorkerList-doubly-linked.patch LGTM. > [v3-0002-Refactor-code-to-handle-death-of-a-backend-or-bgw.patch] Currently, when a child process exits, the postmaster first scans through BackgroundWorkerList, to see if it the child process was a background worker. If not found, then it scans through BackendList to see if it was a regular backend. That leads to some duplication between the bgworker and regular backend cleanup code, as both have an entry in the BackendList that needs to be cleaned up in the same way. Refactor that so that we scan just the BackendList to find the child process, and if it was a background worker, do the additional bgworker-specific cleanup in addition to the normal Backend cleanup. Makes sense. On Windows, if a child process exits with ERROR_WAIT_NO_CHILDREN, it's now logged with that exit code, instead of 0. Also, if a bgworker exits with ERROR_WAIT_NO_CHILDREN, it's now treated as crashed and is restarted. Previously it was treated as a normal exit. Interesting. So when that error was first specially handled in this thread: https://www.postgresql.org/message-id/flat/AANLkTimCTkNKKrHCd3Ot6kAsrSS7SeDpOTcaLsEP7i%2BM%40mail.gmail.com#41f60947571b75377f04af67ba6baf40 ... it went from being considered a crash, to being considered like exit(0). It's true that shared memory can't be corrupted by a process that never enters main(), but it's better not to hide the true reason for the failure (if it is still possible -- I don't find many references to that phenomenon in recent times). Clobbering exitstatus with 0 doesn't seem right at all, now that we have background workers whose restart behaviour is affected by that. If a child process is not found in the BackendList, the log message now calls it "untracked child process" rather than "server process". Arguably that should be a PANIC, because we do track all the child processes in the list, so failing to find a child process is highly unexpected. But if we want to change that, let's discuss and do that as a separate commit. Yeah, it would be highly unexpected if waitpid() told you about some random other process (or we screwed up the bookkeeping and didn't recognise it). So at least having a different message seems good. > * v3-0003-Fix-comment-on-processes-being-kept-over-a-restar.patch +1 > * v3-0004-Consolidate-postmaster-code-to-launch-background-.patch Much of the code in process_pm_child_exit() to launch replacement processes when one exits or when progressing to next postmaster state was unnecessary, because the ServerLoop will launch any missing background processes anyway. Remove the redundant code and let ServerLoop handle it. +1, makes sense. In ServerLoop, move the code to launch all the processes to a new subroutine, to group it all together. +1, makes sense. More soon...
On 08/08/2024 13:47, Thomas Munro wrote: > On Windows, if a child process exits with ERROR_WAIT_NO_CHILDREN, it's > now logged with that exit code, instead of 0. Also, if a bgworker > exits with ERROR_WAIT_NO_CHILDREN, it's now treated as crashed and is > restarted. Previously it was treated as a normal exit. > > Interesting. So when that error was first specially handled in this thread: > > https://www.postgresql.org/message-id/flat/AANLkTimCTkNKKrHCd3Ot6kAsrSS7SeDpOTcaLsEP7i%2BM%40mail.gmail.com#41f60947571b75377f04af67ba6baf40 > > ... it went from being considered a crash, to being considered like > exit(0). It's true that shared memory can't be corrupted by a process > that never enters main(), but it's better not to hide the true reason > for the failure (if it is still possible -- I don't find many > references to that phenomenon in recent times). Clobbering exitstatus > with 0 doesn't seem right at all, now that we have background workers > whose restart behaviour is affected by that. I adjusted this ERROR_WAIT_NO_CHILDREN a little more, to avoid logging the death of the child twice in some cases. >> * v3-0003-Fix-comment-on-processes-being-kept-over-a-restar.patch > > +1 Committed the patches up to and including this one, with tiny comment changes. >> * v3-0004-Consolidate-postmaster-code-to-launch-background-.patch > > Much of the code in process_pm_child_exit() to launch replacement > processes when one exits or when progressing to next postmaster state > was unnecessary, because the ServerLoop will launch any missing > background processes anyway. Remove the redundant code and let > ServerLoop handle it. I'm going to work a little more on the comments on this one before committing; I had just moved all the "If we have lost the XXX, try to start a new one" comments as is, but they look pretty repetitive now. Thanks for the review! -- Heikki Linnakangas Neon (https://neon.tech)
On 10/08/2024 00:13, Heikki Linnakangas wrote: > On 08/08/2024 13:47, Thomas Munro wrote: >>> * v3-0004-Consolidate-postmaster-code-to-launch-background-.patch >> >> Much of the code in process_pm_child_exit() to launch replacement >> processes when one exits or when progressing to next postmaster >> state >> was unnecessary, because the ServerLoop will launch any missing >> background processes anyway. Remove the redundant code and let >> ServerLoop handle it. > > I'm going to work a little more on the comments on this one before > committing; I had just moved all the "If we have lost the XXX, try to > start a new one" comments as is, but they look pretty repetitive now. Pushed this now, after adjusting the comments a bit. Thanks again for the review! Here are the remaining patches, rebased. > commit a1c43d65907d20a999b203e465db1277ec842a0a > Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Thu Aug 1 17:24:12 2024 +0300 > > Introduce a separate BackendType for dead-end children > > And replace postmaster.c's own "backend type" codes with BackendType > > XXX: While working on this, many times I accidentally did something > like "foo |= B_SOMETHING" instead of "foo |= 1 << B_SOMETHING", when > constructing arguments to SignalSomeChildren or CountChildren, and > things broke in very subtle ways taking a long time to debug. The old > constants that were already bitmasks avoided that. Maybe we need some > macro magic or something to make this less error-prone. While rebasing this today, I spotted another instance of that mistake mentioned in the XXX comment above. I called "CountChildren(B_BACKEND)" instead of "CountChildren(1 << B_BACKEND)". Some ideas on how to make that less error-prone: 1. Add a separate typedef for the bitmasks, and macros/functions to work with it. Something like: typedef struct { uint32 mask; } BackendTypeMask; static const BackendTypeMask BTMASK_ALL = { 0xffffffff }; static const BackendTypeMask BTMASK_NONE = { 0 }; static inline BackendTypeMask BTMASK_ADD(BackendTypeMask mask, BackendType t) { mask.mask |= 1 << t; return mask; } static inline BackendTypeMask BTMASK_DEL(BackendTypeMask mask, BackendType t) { mask.mask &= ~(1 << t); return mask; } Now the compiler will complain if you try to pass a BackendType for the mask. We could do this just for BackendType, or we could put this in src/include/lib/ with a more generic name, like "bitmask_u32". 2. Another idea is to redefine the BackendType values to be separate bits, like the current BACKEND_TYPE_* values in postmaster.c: typedef enum BackendType { B_INVALID = 0, /* Backends and other backend-like processes */ B_BACKEND = 1 << 1, B_DEAD_END_BACKEND = 1 << 2, B_AUTOVAC_LAUNCHER = 1 << 3, B_AUTOVAC_WORKER = 1 << 4, ... } BackendType; Then you can use | and & on BackendTypes directly. It makes it less clear which function arguments are a BackendType and which are a bitmask, however. Thoughts, other ideas? -- Heikki Linnakangas Neon (https://neon.tech)
Attachment
- v4-0001-Add-test-for-connection-limits.patch
- v4-0002-Add-test-for-dead-end-backends.patch
- v4-0003-Use-an-shmem_exit-callback-to-remove-backend-from.patch
- v4-0004-Introduce-a-separate-BackendType-for-dead-end-chi.patch
- v4-0005-Kill-dead-end-children-when-there-s-nothing-else-.patch
- v4-0006-Assign-a-child-slot-to-every-postmaster-child-pro.patch
- v4-0007-Pass-MyPMChildSlot-as-an-explicit-argument-to-chi.patch
Hello Heikki, 10.08.2024 00:13, Heikki Linnakangas wrote: > > Committed the patches up to and including this one, with tiny comment changes. I've noticed that on the current HEAD server.log contains binary data (an invalid process name) after a child crash. For example, while playing with -ftapv, I've got: SELECT to_date('2024 613566758 1', 'IYYY IW ID'); server closed the connection unexpectedly grep -a 'was terminated' server.log 2024-08-18 07:07:06.482 UTC|||66c19d96.3482f6|LOG: `�!x� (PID 3441407) was terminated by signal 6: Aborted It looks like this was introduced by commit 28a520c0b (IIUC, namebuf in CleanupBackend() may stay uninitialized in some code paths). Best regards, Alexander
On 18/08/2024 11:00, Alexander Lakhin wrote: > 10.08.2024 00:13, Heikki Linnakangas wrote: >> >> Committed the patches up to and including this one, with tiny comment changes. > > I've noticed that on the current HEAD server.log contains binary data > (an invalid process name) after a child crash. For example, while playing > with -ftapv, I've got: > SELECT to_date('2024 613566758 1', 'IYYY IW ID'); > server closed the connection unexpectedly > > grep -a 'was terminated' server.log > 2024-08-18 07:07:06.482 UTC|||66c19d96.3482f6|LOG: `�!x� (PID 3441407) was terminated by signal 6: Aborted > > It looks like this was introduced by commit 28a520c0b (IIUC, namebuf in > CleanupBackend() may stay uninitialized in some code paths). Fixed, thanks! -- Heikki Linnakangas Neon (https://neon.tech)
Hi, On 2024-08-12 12:55:00 +0300, Heikki Linnakangas wrote: > While rebasing this today, I spotted another instance of that mistake > mentioned in the XXX comment above. I called "CountChildren(B_BACKEND)" > instead of "CountChildren(1 << B_BACKEND)". Some ideas on how to make that > less error-prone: > > 1. Add a separate typedef for the bitmasks, and macros/functions to work > with it. Something like: > > typedef struct { > uint32 mask; > } BackendTypeMask; > > static const BackendTypeMask BTMASK_ALL = { 0xffffffff }; > static const BackendTypeMask BTMASK_NONE = { 0 }; > > static inline BackendTypeMask > BTMASK_ADD(BackendTypeMask mask, BackendType t) > { > mask.mask |= 1 << t; > return mask; > } > > static inline BackendTypeMask > BTMASK_DEL(BackendTypeMask mask, BackendType t) > { > mask.mask &= ~(1 << t); > return mask; > } > > Now the compiler will complain if you try to pass a BackendType for the > mask. We could do this just for BackendType, or we could put this in > src/include/lib/ with a more generic name, like "bitmask_u32". I don't like the second suggestion - that just ends up creating a similar problem in the future because flag values for one thing can be passed to something else. > +Running the tests > +================= > + > +NOTE: You must have given the --enable-tap-tests argument to configure. > + > +Run > + make check > +or > + make installcheck > +You can use "make installcheck" if you previously did "make install". > +In that case, the code in the installation tree is tested. With > +"make check", a temporary installation tree is built from the current > +sources and then tested. > + > +Either way, this test initializes, starts, and stops a test Postgres > +cluster. > + > +See src/test/perl/README for more info about running these tests. Is it really useful to have such instructions all over the tree? > From 93b9e9b6e072f63af9009e0d66ab6d0d62ea8c15 Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Mon, 12 Aug 2024 10:55:11 +0300 > Subject: [PATCH v4 2/8] Add test for dead-end backends > > The code path for launching a dead-end backend because we're out of > slots was not covered by any tests, so add one. (Some tests did hit > the case of launching a dead-end backend because the server is still > starting up, though, so the gap in our test coverage wasn't as big as > it sounds.) > --- > src/test/perl/PostgreSQL/Test/Cluster.pm | 39 +++++++++++++++++++ > .../postmaster/t/001_connection_limits.pl | 17 +++++++- > 2 files changed, 55 insertions(+), 1 deletion(-) Why does this need to use "raw" connections? Can't you just create a bunch of connections with BackgroundPsql? > From 88287a2db95e584018f1c7fa9e992feb7d179ce8 Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Mon, 12 Aug 2024 10:58:35 +0300 > Subject: [PATCH v4 3/8] Use an shmem_exit callback to remove backend from > PMChildFlags on exit > > This seems nicer than having to duplicate the logic between > InitProcess() and ProcKill() for which child processes have a > PMChildFlags slot. > > Move the MarkPostmasterChildActive() call earlier in InitProcess(), > out of the section protected by the spinlock. > --- > src/backend/storage/ipc/pmsignal.c | 10 ++++++-- > src/backend/storage/lmgr/proc.c | 38 ++++++++++-------------------- > src/include/storage/pmsignal.h | 1 - > 3 files changed, 21 insertions(+), 28 deletions(-) > > diff --git a/src/backend/storage/ipc/pmsignal.c b/src/backend/storage/ipc/pmsignal.c > index 27844b46a2..cb99e77476 100644 > --- a/src/backend/storage/ipc/pmsignal.c > +++ b/src/backend/storage/ipc/pmsignal.c > @@ -24,6 +24,7 @@ > #include "miscadmin.h" > #include "postmaster/postmaster.h" > #include "replication/walsender.h" > +#include "storage/ipc.h" > #include "storage/pmsignal.h" > #include "storage/shmem.h" > #include "utils/memutils.h" > @@ -121,6 +122,8 @@ postmaster_death_handler(SIGNAL_ARGS) > > #endif /* USE_POSTMASTER_DEATH_SIGNAL */ > > +static void MarkPostmasterChildInactive(int code, Datum arg); > + > /* > * PMSignalShmemSize > * Compute space needed for pmsignal.c's shared memory > @@ -328,6 +331,9 @@ MarkPostmasterChildActive(void) > slot--; > Assert(PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED); > PMSignalState->PMChildFlags[slot] = PM_CHILD_ACTIVE; > + > + /* Arrange to clean up at exit. */ > + on_shmem_exit(MarkPostmasterChildInactive, 0); > } > > /* > @@ -352,8 +358,8 @@ MarkPostmasterChildWalSender(void) > * MarkPostmasterChildInactive - mark a postmaster child as done using > * shared memory. This is called in the child process. > */ > -void > -MarkPostmasterChildInactive(void) > +static void > +MarkPostmasterChildInactive(int code, Datum arg) > { > int slot = MyPMChildSlot; > > diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c > index ac66da8638..9536469e89 100644 > --- a/src/backend/storage/lmgr/proc.c > +++ b/src/backend/storage/lmgr/proc.c > @@ -308,6 +308,19 @@ InitProcess(void) > if (MyProc != NULL) > elog(ERROR, "you already exist"); > > + /* > + * Before we start accessing the shared memory in a serious way, mark > + * ourselves as an active postmaster child; this is so that the postmaster > + * can detect it if we exit without cleaning up. (XXX autovac launcher > + * currently doesn't participate in this; it probably should.) > + * > + * Slot sync worker also does not participate in it, see comments atop > + * 'struct bkend' in postmaster.c. > + */ > + if (IsUnderPostmaster && !AmAutoVacuumLauncherProcess() && > + !AmLogicalSlotSyncWorkerProcess()) > + MarkPostmasterChildActive(); I'd not necessarily expect a call to MarkPostmasterChildActive() to register an shmem exit hook - but I guess it's unlikely to be moved around in a problematic way. Perhaps something like RegisterPostmasterChild() or such would be a bit clearer? > From dc53f89edbeec99f8633def8aa5f47cd98e7a150 Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Mon, 12 Aug 2024 10:59:04 +0300 > Subject: [PATCH v4 4/8] Introduce a separate BackendType for dead-end children > > And replace postmaster.c's own "backend type" codes with BackendType Hm - it seems a bit odd to open-code this when we actually have a "table driven configuration" available? Why isn't the type a field in child_process_kind? That'd not solve the bitmask confusion issue, but it does seem like a better direction to me? > From 9c832ce33667abc5aef128a17fa9c27daaad872a Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Mon, 12 Aug 2024 10:59:27 +0300 > Subject: [PATCH v4 5/8] Kill dead-end children when there's nothing else left > > Previously, the postmaster would never try to kill dead-end child > processes, even if there were no other processes left. A dead-end > backend will eventually exit, when authentication_timeout expires, but > if a dead-end backend is the only thing that's preventing the server > from shutting down, it seems better to kill it immediately. It's > particularly important, if there was a bug in the early startup code > that prevented a dead-end child from timing out and exiting normally. I do wonder if we shouldn't instead get rid of dead end children. We now have an event based loop in postmaster, it'd perform vastly better to juts handle these connections in postmaster. And we'd get rid of these weird backend types. But I guess this is a worthwhile improvement on its own... > Includes a test for that case where a dead-end backend previously kept > the server from shutting down. The test hardcodes timeouts, I think we've largely come to regret that when we did. Should probably just be a multiplier based on PostgreSQL::Test::Utils::timeout_default? > +/* > + * MaxLivePostmasterChildren > + * > + * This reports the number postmaster child processes that can be active. It > + * includes all children except for dead_end children. This allows the array > + * in shared memory (PMChildFlags) to have a fixed maximum size. > + */ > +int > +MaxLivePostmasterChildren(void) > +{ > + int n = 0; > + > + /* We know exactly how mamy worker and aux processes can be active */ > + n += autovacuum_max_workers; > + n += max_worker_processes; > + n += NUM_AUXILIARY_PROCS; > + > + /* > + * We allow more connections here than we can have backends because some > + * might still be authenticating; they might fail auth, or some existing > + * backend might exit before the auth cycle is completed. The exact > + * MaxBackends limit is enforced when a new backend tries to join the > + * shared-inval backend array. > + */ > + n += 2 * (MaxConnections + max_wal_senders); > + > + return n; > +} I wonder if we could instead maintain at least some of this in child_process_kinds? Manually listing different types of processes in different places doesn't seem particularly sustainable. > +/* > + * Initialize at postmaster startup > + */ > +void > +InitPostmasterChildSlots(void) > +{ > + int num_pmchild_slots; > + int slotno; > + PMChild *slots; > + > + dlist_init(&freeBackendList); > + dlist_init(&freeAutoVacWorkerList); > + dlist_init(&freeBgWorkerList); > + dlist_init(&freeAuxList); > + dlist_init(&ActiveChildList); > + > + num_pmchild_slots = MaxLivePostmasterChildren(); > + > + slots = palloc(num_pmchild_slots * sizeof(PMChild)); > + > + slotno = 0; > + for (int i = 0; i < 2 * (MaxConnections + max_wal_senders); i++) > + { > + init_slot(&slots[slotno], slotno, &freeBackendList); > + slotno++; > + } > + for (int i = 0; i < autovacuum_max_workers; i++) > + { > + init_slot(&slots[slotno], slotno, &freeAutoVacWorkerList); > + slotno++; > + } > + for (int i = 0; i < max_worker_processes; i++) > + { > + init_slot(&slots[slotno], slotno, &freeBgWorkerList); > + slotno++; > + } > + for (int i = 0; i < NUM_AUXILIARY_PROCS; i++) > + { > + init_slot(&slots[slotno], slotno, &freeAuxList); > + slotno++; > + } > + Assert(slotno == num_pmchild_slots); > +} Along the same vein - could we generalize this into one array of different slot types and then loop over that to initialize / acquire the slots? > +/* Return the appropriate free-list for the given backend type */ > +static dlist_head * > +GetFreeList(BackendType btype) > +{ > + switch (btype) > + { > + case B_BACKEND: > + case B_BG_WORKER: > + case B_WAL_SENDER: > + case B_SLOTSYNC_WORKER: > + return &freeBackendList; Maybe a daft question - but why are all of these in the same list? Sure, they're essentially all full backends, but they're covered by different GUCs? > + /* > + * Auxiliary processes. There can be only one of each of these > + * running at a time. > + */ > + case B_AUTOVAC_LAUNCHER: > + case B_ARCHIVER: > + case B_BG_WRITER: > + case B_CHECKPOINTER: > + case B_STARTUP: > + case B_WAL_RECEIVER: > + case B_WAL_SUMMARIZER: > + case B_WAL_WRITER: > + return &freeAuxList; > + > + /* > + * Logger is not connected to shared memory, and does not have a > + * PGPROC entry, but we still allocate a child slot for it. > + */ Tangential: Why do we need a freelist for these and why do we choose a random pgproc for these instead of assigning one statically? Background: I'd like to not provide AIO workers with "bounce buffers" (for IO of buffers that can't be done in-place, like writes when checksums are enabled). The varying proc numbers make that harder than it'd have to be... > +PMChild * > +AssignPostmasterChildSlot(BackendType btype) > +{ > + dlist_head *freelist; > + PMChild *pmchild; > + > + freelist = GetFreeList(btype); > + > + if (dlist_is_empty(freelist)) > + return NULL; > + > + pmchild = dlist_container(PMChild, elem, dlist_pop_head_node(freelist)); > + pmchild->pid = 0; > + pmchild->bkend_type = btype; > + pmchild->rw = NULL; > + pmchild->bgworker_notify = true; > + > + /* > + * pmchild->child_slot for each entry was initialized when the array of > + * slots was allocated. > + */ > + > + dlist_push_head(&ActiveChildList, &pmchild->elem); > + > + ReservePostmasterChildSlot(pmchild->child_slot); > + > + /* FIXME: find a more elegant way to pass this */ > + MyPMChildSlot = pmchild->child_slot; What if we assigned one offset for each process and assigned its ID here and also used that for its ProcNumber - that way we wouldn't need to manage freelists in two places. > +PMChild * > +FindPostmasterChildByPid(int pid) > +{ > + dlist_iter iter; > + > + dlist_foreach(iter, &ActiveChildList) > + { > + PMChild *bp = dlist_container(PMChild, elem, iter.cur); > + > + if (bp->pid == pid) > + return bp; > + } > + return NULL; > +} It's not new, but it's quite sad that postmaster's process exit handling is effectively O(Backends^2)... > @@ -1019,7 +980,15 @@ PostmasterMain(int argc, char *argv[]) > /* > * If enabled, start up syslogger collection subprocess > */ > - SysLoggerPID = SysLogger_Start(); > + SysLoggerPMChild = AssignPostmasterChildSlot(B_LOGGER); > + if (!SysLoggerPMChild) > + elog(ERROR, "no postmaster child slot available for syslogger"); > + SysLoggerPMChild->pid = SysLogger_Start(); > + if (SysLoggerPMChild->pid == 0) > + { > + FreePostmasterChildSlot(SysLoggerPMChild); > + SysLoggerPMChild = NULL; > + } Maybe it's a bit obsessive, but this seems long enough to make it worth not doing inline in the already long PostmasterMain(). > /* > * We're ready to rock and roll... > */ > - StartupPID = StartChildProcess(B_STARTUP); > - Assert(StartupPID != 0); > + StartupPMChild = StartChildProcess(B_STARTUP); > + Assert(StartupPMChild != NULL); This (not new) assertion is ... odd. > @@ -1779,21 +1748,6 @@ canAcceptConnections(int backend_type) > if (!connsAllowed && backend_type == B_BACKEND) > return CAC_SHUTDOWN; /* shutdown is pending */ > > - /* > - * Don't start too many children. > - * > - * We allow more connections here than we can have backends because some > - * might still be authenticating; they might fail auth, or some existing > - * backend might exit before the auth cycle is completed. The exact > - * MaxBackends limit is enforced when a new backend tries to join the > - * shared-inval backend array. > - * > - * The limit here must match the sizes of the per-child-process arrays; > - * see comments for MaxLivePostmasterChildren(). > - */ > - if (CountChildren(BACKEND_TYPE_ALL & ~(1 << B_DEAD_END_BACKEND)) >= MaxLivePostmasterChildren()) > - result = CAC_TOOMANY; > - > return result; > } It's nice to get rid of this source of O(N^2). > @@ -1961,26 +1915,6 @@ process_pm_reload_request(void) > (errmsg("received SIGHUP, reloading configuration files"))); > ProcessConfigFile(PGC_SIGHUP); > SignalSomeChildren(SIGHUP, BACKEND_TYPE_ALL & ~(1 << B_DEAD_END_BACKEND)); > - if (StartupPID != 0) > - signal_child(StartupPID, SIGHUP); > - if (BgWriterPID != 0) > - signal_child(BgWriterPID, SIGHUP); > - if (CheckpointerPID != 0) > - signal_child(CheckpointerPID, SIGHUP); > - if (WalWriterPID != 0) > - signal_child(WalWriterPID, SIGHUP); > - if (WalReceiverPID != 0) > - signal_child(WalReceiverPID, SIGHUP); > - if (WalSummarizerPID != 0) > - signal_child(WalSummarizerPID, SIGHUP); > - if (AutoVacPID != 0) > - signal_child(AutoVacPID, SIGHUP); > - if (PgArchPID != 0) > - signal_child(PgArchPID, SIGHUP); > - if (SysLoggerPID != 0) > - signal_child(SysLoggerPID, SIGHUP); > - if (SlotSyncWorkerPID != 0) > - signal_child(SlotSyncWorkerPID, SIGHUP); > > /* Reload authentication config files too */ > if (!load_hba()) For a moment I wondered why this change was part of this commit - but I guess we didn't have any of these in an array/list before this change... > @@ -2469,11 +2410,15 @@ process_pm_child_exit(void) > } > > /* Was it the system logger? If so, try to start a new one */ > - if (pid == SysLoggerPID) > + if (SysLoggerPMChild && pid == SysLoggerPMChild->pid) > { > - SysLoggerPID = 0; > /* for safety's sake, launch new logger *first* */ > - SysLoggerPID = SysLogger_Start(); > + SysLoggerPMChild->pid = SysLogger_Start(); > + if (SysLoggerPMChild->pid == 0) > + { > + FreePostmasterChildSlot(SysLoggerPMChild); > + SysLoggerPMChild = NULL; > + } > if (!EXIT_STATUS_0(exitstatus)) > LogChildExit(LOG, _("system logger process"), Seems a bit weird to have one place with a different memory lifetime handling than other places. Why don't we just do this the same way as in other places but continue to defer the logging until after we tried to start the new logger? Might be worth having a test ensuring that loggers restart OK. > /* Construct a process name for log message */ > + > + /* > + * FIXME: use GetBackendTypeDesc here? How does the localization of that > + * work? > + */ > if (bp->bkend_type == B_DEAD_END_BACKEND) > { > procname = _("dead end backend"); Might be worth having a version of GetBackendTypeDesc() that returns a translated string? > @@ -2697,9 +2643,16 @@ HandleChildCrash(int pid, int exitstatus, const char *procname) > { > dlist_iter iter; > > - dlist_foreach(iter, &BackendList) > + dlist_foreach(iter, &ActiveChildList) > { > - Backend *bp = dlist_container(Backend, elem, iter.cur); > + PMChild *bp = dlist_container(PMChild, elem, iter.cur); > + > + /* We do NOT restart the syslogger */ > + if (bp == SysLoggerPMChild) > + continue; That comment seems a bit misleading - we do restart syslogger, we just don't do it here, no? I realize it's an old comment, but it still seems like it's worth fixing given that you touch all the code here... > @@ -2708,48 +2661,8 @@ HandleChildCrash(int pid, int exitstatus, const char *procname) > * We could exclude dead_end children here, but at least when > * sending SIGABRT it seems better to include them. > */ > - sigquit_child(bp->pid); > + sigquit_child(bp); > } > - > - if (StartupPID != 0) > - { > - sigquit_child(StartupPID); > - StartupStatus = STARTUP_SIGNALED; > - } > - > - /* Take care of the bgwriter too */ > - if (BgWriterPID != 0) > - sigquit_child(BgWriterPID); > - > - /* Take care of the checkpointer too */ > - if (CheckpointerPID != 0) > - sigquit_child(CheckpointerPID); > - > - /* Take care of the walwriter too */ > - if (WalWriterPID != 0) > - sigquit_child(WalWriterPID); > - > - /* Take care of the walreceiver too */ > - if (WalReceiverPID != 0) > - sigquit_child(WalReceiverPID); > - > - /* Take care of the walsummarizer too */ > - if (WalSummarizerPID != 0) > - sigquit_child(WalSummarizerPID); > - > - /* Take care of the autovacuum launcher too */ > - if (AutoVacPID != 0) > - sigquit_child(AutoVacPID); > - > - /* Take care of the archiver too */ > - if (PgArchPID != 0) > - sigquit_child(PgArchPID); > - > - /* Take care of the slot sync worker too */ > - if (SlotSyncWorkerPID != 0) > - sigquit_child(SlotSyncWorkerPID); > - > - /* We do NOT restart the syslogger */ > } Yay. > @@ -2871,29 +2786,27 @@ PostmasterStateMachine(void) > <snip> > - if (WalSummarizerPID != 0) > - signal_child(WalSummarizerPID, SIGTERM); > - if (SlotSyncWorkerPID != 0) > - signal_child(SlotSyncWorkerPID, SIGTERM); > + targetMask |= (1 << B_STARTUP); > + targetMask |= (1 << B_WAL_RECEIVER); > + > + targetMask |= (1 << B_WAL_SUMMARIZER); > + targetMask |= (1 << B_SLOTSYNC_WORKER); > /* checkpointer, archiver, stats, and syslogger may continue for now */ > > + SignalSomeChildren(SIGTERM, targetMask); > + > /* Now transition to PM_WAIT_BACKENDS state to wait for them to die */ > pmState = PM_WAIT_BACKENDS; > <snip> It's likely the right thing to not do as one patch, but IMO this really wants to be a state table. Perhaps as part of child_process_kinds, perhaps separate from that. > @@ -3130,8 +3047,21 @@ static void > LaunchMissingBackgroundProcesses(void) > { > /* Syslogger is active in all states */ > - if (SysLoggerPID == 0 && Logging_collector) > - SysLoggerPID = SysLogger_Start(); > + if (SysLoggerPMChild == NULL && Logging_collector) > + { > + SysLoggerPMChild = AssignPostmasterChildSlot(B_LOGGER); > + if (!SysLoggerPMChild) > + elog(LOG, "no postmaster child slot available for syslogger"); How could this elog() be reached? Seems something seriously would have gone wrong to get here - in which case a LOG that might not even be visible (due to logger not working) doesn't seem like the right response. > @@ -3334,29 +3270,12 @@ SignalSomeChildren(int signal, uint32 targetMask) > static void > TerminateChildren(int signal) > { The comment for TerminateChildren() says "except syslogger and dead_end backends." - aren't you including the latter here? > @@ -311,14 +311,9 @@ InitProcess(void) > /* > * Before we start accessing the shared memory in a serious way, mark > * ourselves as an active postmaster child; this is so that the postmaster > - * can detect it if we exit without cleaning up. (XXX autovac launcher > - * currently doesn't participate in this; it probably should.) > - * > - * Slot sync worker also does not participate in it, see comments atop > - * 'struct bkend' in postmaster.c. > + * can detect it if we exit without cleaning up. > */ > - if (IsUnderPostmaster && !AmAutoVacuumLauncherProcess() && > - !AmLogicalSlotSyncWorkerProcess()) > + if (IsUnderPostmaster) > MarkPostmasterChildActive(); > > /* Decide which list should supply our PGPROC. */ > @@ -536,6 +531,9 @@ InitAuxiliaryProcess(void) > if (MyProc != NULL) > elog(ERROR, "you already exist"); > > + if (IsUnderPostmaster) > + MarkPostmasterChildActive(); > + > /* > * We use the ProcStructLock to protect assignment and releasing of > * AuxiliaryProcs entries. Probably worth, at some point soon, to have an InitProcessCommon() or such. Greetings, Andres Freund
On 04/09/2024 17:35, Andres Freund wrote: > On 2024-08-12 12:55:00 +0300, Heikki Linnakangas wrote: >> +Running the tests >> +================= >> + >> +NOTE: You must have given the --enable-tap-tests argument to configure. >> + >> +Run >> + make check >> +or >> + make installcheck >> +You can use "make installcheck" if you previously did "make install". >> +In that case, the code in the installation tree is tested. With >> +"make check", a temporary installation tree is built from the current >> +sources and then tested. >> + >> +Either way, this test initializes, starts, and stops a test Postgres >> +cluster. >> + >> +See src/test/perl/README for more info about running these tests. > > Is it really useful to have such instructions all over the tree? That's debatable but I didn't want to go down that rabbit hole with this patch. It's repetitive for sure. But there are small variations in which PG_TEST_EXTRA options you need, whether "make installcheck" runs against a running server or still creates a temporary cluster, etc. I tried to deduplicate those instructions by moving the above boilerplate to src/test/README, and only noting the variations in the subdirectory READMEs. I didn't like the result. It's very helpful to have full copy-pasteable commands with all the right "PG_TEST_EXTRA" options for each test. These instructions also don't mention how to run the tests with Meson. The first time I wanted to run individual tests with Meson, it took me a while to figure it out. I'll think a little more about how to improve these READMEs, but let's take that to a separate thread. >> From 93b9e9b6e072f63af9009e0d66ab6d0d62ea8c15 Mon Sep 17 00:00:00 2001 >> From: Heikki Linnakangas <heikki.linnakangas@iki.fi> >> Date: Mon, 12 Aug 2024 10:55:11 +0300 >> Subject: [PATCH v4 2/8] Add test for dead-end backends >> >> The code path for launching a dead-end backend because we're out of >> slots was not covered by any tests, so add one. (Some tests did hit >> the case of launching a dead-end backend because the server is still >> starting up, though, so the gap in our test coverage wasn't as big as >> it sounds.) >> --- >> src/test/perl/PostgreSQL/Test/Cluster.pm | 39 +++++++++++++++++++ >> .../postmaster/t/001_connection_limits.pl | 17 +++++++- >> 2 files changed, 55 insertions(+), 1 deletion(-) > > Why does this need to use "raw" connections? Can't you just create a bunch of > connections with BackgroundPsql? No, these need to be connections that haven't sent the startup packet the yet. With Andrew's PqFFI work [1], we could do better. The latest version on that thread doesn't expose the async functions like PQconnectStart() PQconnectPoll() though, but they can be added. [1] https://www.postgresql.org/message-id/97d1d1b9-d147-f69d-1991-d8794efed41c%40dunslane.net Unless you have comments on these first two patches which just add tests, I'll commit them shortly. Still processing the rest of your comments... -- Heikki Linnakangas Neon (https://neon.tech)
On 04/09/2024 17:35, Andres Freund wrote: > On 2024-08-12 12:55:00 +0300, Heikki Linnakangas wrote: >> From dc53f89edbeec99f8633def8aa5f47cd98e7a150 Mon Sep 17 00:00:00 2001 >> From: Heikki Linnakangas <heikki.linnakangas@iki.fi> >> Date: Mon, 12 Aug 2024 10:59:04 +0300 >> Subject: [PATCH v4 4/8] Introduce a separate BackendType for dead-end children >> >> And replace postmaster.c's own "backend type" codes with BackendType > > Hm - it seems a bit odd to open-code this when we actually have a "table > driven configuration" available? Why isn't the type a field in > child_process_kind? Sorry, I didn't understand this. What exactly would you add to child_process_kind? Where would you use it? >> +/* >> + * MaxLivePostmasterChildren >> + * >> + * This reports the number postmaster child processes that can be active. It >> + * includes all children except for dead_end children. This allows the array >> + * in shared memory (PMChildFlags) to have a fixed maximum size. >> + */ >> +int >> +MaxLivePostmasterChildren(void) >> +{ >> + int n = 0; >> + >> + /* We know exactly how mamy worker and aux processes can be active */ >> + n += autovacuum_max_workers; >> + n += max_worker_processes; >> + n += NUM_AUXILIARY_PROCS; >> + >> + /* >> + * We allow more connections here than we can have backends because some >> + * might still be authenticating; they might fail auth, or some existing >> + * backend might exit before the auth cycle is completed. The exact >> + * MaxBackends limit is enforced when a new backend tries to join the >> + * shared-inval backend array. >> + */ >> + n += 2 * (MaxConnections + max_wal_senders); >> + >> + return n; >> +} > > I wonder if we could instead maintain at least some of this in > child_process_kinds? Manually listing different types of processes in > different places doesn't seem particularly sustainable. Hmm, you mean adding "max this kind of children" field to child_process_kinds? Perhaps. > >> +/* >> + * Initialize at postmaster startup >> + */ >> +void >> +InitPostmasterChildSlots(void) >> +{ >> + int num_pmchild_slots; >> + int slotno; >> + PMChild *slots; >> + >> + dlist_init(&freeBackendList); >> + dlist_init(&freeAutoVacWorkerList); >> + dlist_init(&freeBgWorkerList); >> + dlist_init(&freeAuxList); >> + dlist_init(&ActiveChildList); >> + >> + num_pmchild_slots = MaxLivePostmasterChildren(); >> + >> + slots = palloc(num_pmchild_slots * sizeof(PMChild)); >> + >> + slotno = 0; >> + for (int i = 0; i < 2 * (MaxConnections + max_wal_senders); i++) >> + { >> + init_slot(&slots[slotno], slotno, &freeBackendList); >> + slotno++; >> + } >> + for (int i = 0; i < autovacuum_max_workers; i++) >> + { >> + init_slot(&slots[slotno], slotno, &freeAutoVacWorkerList); >> + slotno++; >> + } >> + for (int i = 0; i < max_worker_processes; i++) >> + { >> + init_slot(&slots[slotno], slotno, &freeBgWorkerList); >> + slotno++; >> + } >> + for (int i = 0; i < NUM_AUXILIARY_PROCS; i++) >> + { >> + init_slot(&slots[slotno], slotno, &freeAuxList); >> + slotno++; >> + } >> + Assert(slotno == num_pmchild_slots); >> +} > > Along the same vein - could we generalize this into one array of different > slot types and then loop over that to initialize / acquire the slots? Makes sense. >> +/* Return the appropriate free-list for the given backend type */ >> +static dlist_head * >> +GetFreeList(BackendType btype) >> +{ >> + switch (btype) >> + { >> + case B_BACKEND: >> + case B_BG_WORKER: >> + case B_WAL_SENDER: >> + case B_SLOTSYNC_WORKER: >> + return &freeBackendList; > > Maybe a daft question - but why are all of these in the same list? Sure, > they're essentially all full backends, but they're covered by different GUCs? No reason. No particular reason they should *not* share the same list either though. > >> + /* >> + * Auxiliary processes. There can be only one of each of these >> + * running at a time. >> + */ >> + case B_AUTOVAC_LAUNCHER: >> + case B_ARCHIVER: >> + case B_BG_WRITER: >> + case B_CHECKPOINTER: >> + case B_STARTUP: >> + case B_WAL_RECEIVER: >> + case B_WAL_SUMMARIZER: >> + case B_WAL_WRITER: >> + return &freeAuxList; >> + >> + /* >> + * Logger is not connected to shared memory, and does not have a >> + * PGPROC entry, but we still allocate a child slot for it. >> + */ > > Tangential: Why do we need a freelist for these and why do we choose a random > pgproc for these instead of assigning one statically? > > Background: I'd like to not provide AIO workers with "bounce buffers" (for IO > of buffers that can't be done in-place, like writes when checksums are > enabled). The varying proc numbers make that harder than it'd have to be... Yeah, we can make these fixed.Currently, the # of slots reserved for aux processes is sized by NUM_AUXILIARY_PROCS, which is one smaller than the number of different aux proces kinds: > /* > * We set aside some extra PGPROC structures for auxiliary processes, > * ie things that aren't full-fledged backends but need shmem access. > * > * Background writer, checkpointer, WAL writer, WAL summarizer, and archiver > * run during normal operation. Startup process and WAL receiver also consume > * 2 slots, but WAL writer is launched only after startup has exited, so we > * only need 6 slots. > */ > #define NUM_AUXILIARY_PROCS 6 For PMChildSlot numbers, we could certainly just allocate one more slot. It would probably make sense for PGPROCs too, even though PGPROC is a much larger struct. >> +PMChild * >> +AssignPostmasterChildSlot(BackendType btype) >> +{ >> + dlist_head *freelist; >> + PMChild *pmchild; >> + >> + freelist = GetFreeList(btype); >> + >> + if (dlist_is_empty(freelist)) >> + return NULL; >> + >> + pmchild = dlist_container(PMChild, elem, dlist_pop_head_node(freelist)); >> + pmchild->pid = 0; >> + pmchild->bkend_type = btype; >> + pmchild->rw = NULL; >> + pmchild->bgworker_notify = true; >> + >> + /* >> + * pmchild->child_slot for each entry was initialized when the array of >> + * slots was allocated. >> + */ >> + >> + dlist_push_head(&ActiveChildList, &pmchild->elem); >> + >> + ReservePostmasterChildSlot(pmchild->child_slot); >> + >> + /* FIXME: find a more elegant way to pass this */ >> + MyPMChildSlot = pmchild->child_slot; > > What if we assigned one offset for each process and assigned its ID here and > also used that for its ProcNumber - that way we wouldn't need to manage > freelists in two places. It's currently possible to have up to 2 * max_connections backends in the authentication phase. We would have to change that behaviour, or make the PGPROC array 2x larger. It might well be worth it, I don't know how sensible the current behaviour is. But I'd like to punt that to later patch, to keep the scope of this patch set reasonable. It's pretty straightforward to do later on top of this if we want to. >> +PMChild * >> +FindPostmasterChildByPid(int pid) >> +{ >> + dlist_iter iter; >> + >> + dlist_foreach(iter, &ActiveChildList) >> + { >> + PMChild *bp = dlist_container(PMChild, elem, iter.cur); >> + >> + if (bp->pid == pid) >> + return bp; >> + } >> + return NULL; >> +} > > It's not new, but it's quite sad that postmaster's process exit handling is > effectively O(Backends^2)... It would be straightforward to turn ActiveChildList into a hash table. But I'd like to leave that to a followup patch too. >> /* >> * We're ready to rock and roll... >> */ >> - StartupPID = StartChildProcess(B_STARTUP); >> - Assert(StartupPID != 0); >> + StartupPMChild = StartChildProcess(B_STARTUP); >> + Assert(StartupPMChild != NULL); > > This (not new) assertion is ... odd. Yeah, it's an assertion because StartChildProcess has this: > /* > * fork failure is fatal during startup, but there's no need to choke > * immediately if starting other child types fails. > */ > if (type == B_STARTUP) > ExitPostmaster(1); >> @@ -1961,26 +1915,6 @@ process_pm_reload_request(void) >> (errmsg("received SIGHUP, reloading configuration files"))); >> ProcessConfigFile(PGC_SIGHUP); >> SignalSomeChildren(SIGHUP, BACKEND_TYPE_ALL & ~(1 << B_DEAD_END_BACKEND)); >> - if (StartupPID != 0) >> - signal_child(StartupPID, SIGHUP); >> - if (BgWriterPID != 0) >> - signal_child(BgWriterPID, SIGHUP); >> - if (CheckpointerPID != 0) >> - signal_child(CheckpointerPID, SIGHUP); >> - if (WalWriterPID != 0) >> - signal_child(WalWriterPID, SIGHUP); >> - if (WalReceiverPID != 0) >> - signal_child(WalReceiverPID, SIGHUP); >> - if (WalSummarizerPID != 0) >> - signal_child(WalSummarizerPID, SIGHUP); >> - if (AutoVacPID != 0) >> - signal_child(AutoVacPID, SIGHUP); >> - if (PgArchPID != 0) >> - signal_child(PgArchPID, SIGHUP); >> - if (SysLoggerPID != 0) >> - signal_child(SysLoggerPID, SIGHUP); >> - if (SlotSyncWorkerPID != 0) >> - signal_child(SlotSyncWorkerPID, SIGHUP); >> >> /* Reload authentication config files too */ >> if (!load_hba()) > > For a moment I wondered why this change was part of this commit - but I guess > we didn't have any of these in an array/list before this change... Correct. >> @@ -2469,11 +2410,15 @@ process_pm_child_exit(void) >> } >> >> /* Was it the system logger? If so, try to start a new one */ >> - if (pid == SysLoggerPID) >> + if (SysLoggerPMChild && pid == SysLoggerPMChild->pid) >> { >> - SysLoggerPID = 0; >> /* for safety's sake, launch new logger *first* */ >> - SysLoggerPID = SysLogger_Start(); >> + SysLoggerPMChild->pid = SysLogger_Start(); >> + if (SysLoggerPMChild->pid == 0) >> + { >> + FreePostmasterChildSlot(SysLoggerPMChild); >> + SysLoggerPMChild = NULL; >> + } >> if (!EXIT_STATUS_0(exitstatus)) >> LogChildExit(LOG, _("system logger process"), > > Seems a bit weird to have one place with a different memory lifetime handling > than other places. Why don't we just do this the same way as in other places > but continue to defer the logging until after we tried to start the new > logger? Hmm, you mean let LaunchMissingBackgroundProcesses() handle the restart? I'm a little scared of changing the existing logic. We don't have a mechanism for deferring logging, so we would have to invent that, or the logs would just accumulate in the pipe until syslogger starts up. There's some code between here and LaunchMissingBackgroundProcesses(), so might postmaster get blocked between writing to the syslogger pipe, before having restarted it? If forking the syslogger process fails, that can happen anyway, though. > Might be worth having a test ensuring that loggers restart OK. Yeah.. >> /* Construct a process name for log message */ >> + >> + /* >> + * FIXME: use GetBackendTypeDesc here? How does the localization of that >> + * work? >> + */ >> if (bp->bkend_type == B_DEAD_END_BACKEND) >> { >> procname = _("dead end backend"); > > Might be worth having a version of GetBackendTypeDesc() that returns a > translated string? Constructing the string for background workers is a little more complicated: snprintf(namebuf, MAXPGPATH, _("background worker \"%s\""), bp->rw->rw_worker.bgw_type); We could still do that for background workers and use the transalated variant of GetBackendTypeDesc() for everything else though. >> @@ -2697,9 +2643,16 @@ HandleChildCrash(int pid, int exitstatus, const char *procname) >> { >> dlist_iter iter; >> >> - dlist_foreach(iter, &BackendList) >> + dlist_foreach(iter, &ActiveChildList) >> { >> - Backend *bp = dlist_container(Backend, elem, iter.cur); >> + PMChild *bp = dlist_container(PMChild, elem, iter.cur); >> + >> + /* We do NOT restart the syslogger */ >> + if (bp == SysLoggerPMChild) >> + continue; > > That comment seems a bit misleading - we do restart syslogger, we just don't > do it here, no? I realize it's an old comment, but it still seems like it's > worth fixing given that you touch all the code here... No, we really do not restart the syslogger. This code runs when *another* process has crashed unexpectedly. We kill all other processes, reinitialize shared memory and restart, but the old syslogger keeps running through all that. I'll add a note on that to InitPostmasterChildSlots(), as it's a bit surprising. >> @@ -2871,29 +2786,27 @@ PostmasterStateMachine(void) >> <snip> >> - if (WalSummarizerPID != 0) >> - signal_child(WalSummarizerPID, SIGTERM); >> - if (SlotSyncWorkerPID != 0) >> - signal_child(SlotSyncWorkerPID, SIGTERM); >> + targetMask |= (1 << B_STARTUP); >> + targetMask |= (1 << B_WAL_RECEIVER); >> + >> + targetMask |= (1 << B_WAL_SUMMARIZER); >> + targetMask |= (1 << B_SLOTSYNC_WORKER); >> /* checkpointer, archiver, stats, and syslogger may continue for now */ >> >> + SignalSomeChildren(SIGTERM, targetMask); >> + >> /* Now transition to PM_WAIT_BACKENDS state to wait for them to die */ >> pmState = PM_WAIT_BACKENDS; >> <snip> > > It's likely the right thing to not do as one patch, but IMO this really wants > to be a state table. Perhaps as part of child_process_kinds, perhaps separate > from that. Yeah. I've tried to refactor this into a table before, but didn't come up with anything that I was happy with. I also feel there must be a better way to organize this, but not sure what exactly. I hope that will become more apparent after these other changes. >> @@ -3130,8 +3047,21 @@ static void >> LaunchMissingBackgroundProcesses(void) >> { >> /* Syslogger is active in all states */ >> - if (SysLoggerPID == 0 && Logging_collector) >> - SysLoggerPID = SysLogger_Start(); >> + if (SysLoggerPMChild == NULL && Logging_collector) >> + { >> + SysLoggerPMChild = AssignPostmasterChildSlot(B_LOGGER); >> + if (!SysLoggerPMChild) >> + elog(LOG, "no postmaster child slot available for syslogger"); > > How could this elog() be reached? Seems something seriously would have gone > wrong to get here - in which case a LOG that might not even be visible (due to > logger not working) doesn't seem like the right response. I'll turn it into an assertion or PANIC. >> @@ -3334,29 +3270,12 @@ SignalSomeChildren(int signal, uint32 targetMask) >> static void >> TerminateChildren(int signal) >> { > > The comment for TerminateChildren() says "except syslogger and dead_end > backends." - aren't you including the latter here? The comment is adjusted in v4-0004-Introduce-a-separate-BackendType-for-dead-end-chi.patch. Before that, SignalChildren() does ignore dead-end children. Thanks for the review! -- Heikki Linnakangas Neon (https://neon.tech)
On Fri, Sep 6, 2024 at 9:13 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > It's currently possible to have up to 2 * max_connections backends in > the authentication phase. We would have to change that behaviour, or > make the PGPROC array 2x larger. I know I already said this elsewhere, but in case it got lost in the shuffle, +1 for changing this, unless somebody can make a compelling argument why 2 * max_connections isn't WAY too many. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2024-09-06 16:13:43 +0300, Heikki Linnakangas wrote: > On 04/09/2024 17:35, Andres Freund wrote: > > On 2024-08-12 12:55:00 +0300, Heikki Linnakangas wrote: > > > From dc53f89edbeec99f8633def8aa5f47cd98e7a150 Mon Sep 17 00:00:00 2001 > > > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > > > Date: Mon, 12 Aug 2024 10:59:04 +0300 > > > Subject: [PATCH v4 4/8] Introduce a separate BackendType for dead-end children > > > > > > And replace postmaster.c's own "backend type" codes with BackendType > > > > Hm - it seems a bit odd to open-code this when we actually have a "table > > driven configuration" available? Why isn't the type a field in > > child_process_kind? > > Sorry, I didn't understand this. What exactly would you add to > child_process_kind? Where would you use it? I'm not entirely sure what I was thinking of. It might be partially triggering a prior complaint I had about manually assigning things to MyBackendType, despite actually having all the information already. One thing that I just noticed is that this patch orphans comment references to BACKEND_TYPE_AUTOVAC and BACKEND_TYPE_BGWORKER. Seems a tad odd to have BACKEND_TYPE_ALL after removing everything else from the BACKEND_TYPE_* "namespace". To deal with the issue around bitmasks you had mentioned, I think we should at least have a static inline function to convert B_* values to the bitmask index. > > > +/* > > > + * MaxLivePostmasterChildren > > > + * > > > + * This reports the number postmaster child processes that can be active. It > > > + * includes all children except for dead_end children. This allows the array > > > + * in shared memory (PMChildFlags) to have a fixed maximum size. > > > + */ > > > +int > > > +MaxLivePostmasterChildren(void) > > > +{ > > > + int n = 0; > > > + > > > + /* We know exactly how mamy worker and aux processes can be active */ > > > + n += autovacuum_max_workers; > > > + n += max_worker_processes; > > > + n += NUM_AUXILIARY_PROCS; > > > + > > > + /* > > > + * We allow more connections here than we can have backends because some > > > + * might still be authenticating; they might fail auth, or some existing > > > + * backend might exit before the auth cycle is completed. The exact > > > + * MaxBackends limit is enforced when a new backend tries to join the > > > + * shared-inval backend array. > > > + */ > > > + n += 2 * (MaxConnections + max_wal_senders); > > > + > > > + return n; > > > +} > > > > I wonder if we could instead maintain at least some of this in > > child_process_kinds? Manually listing different types of processes in > > different places doesn't seem particularly sustainable. > > Hmm, you mean adding "max this kind of children" field to > child_process_kinds? Perhaps. Yep, that's what I meant. > > > +/* Return the appropriate free-list for the given backend type */ > > > +static dlist_head * > > > +GetFreeList(BackendType btype) > > > +{ > > > + switch (btype) > > > + { > > > + case B_BACKEND: > > > + case B_BG_WORKER: > > > + case B_WAL_SENDER: > > > + case B_SLOTSYNC_WORKER: > > > + return &freeBackendList; > > > > Maybe a daft question - but why are all of these in the same list? Sure, > > they're essentially all full backends, but they're covered by different GUCs? > > No reason. No particular reason they should *not* share the same list either > though. Aren't they controlled by distinct connection limits? Isn't there a danger that we could use up entries and fail connections due to that, despite not actually being above the limit? > > Tangential: Why do we need a freelist for these and why do we choose a random > > pgproc for these instead of assigning one statically? > > > > Background: I'd like to not provide AIO workers with "bounce buffers" (for IO > > of buffers that can't be done in-place, like writes when checksums are > > enabled). The varying proc numbers make that harder than it'd have to be... > > Yeah, we can make these fixed. Cool. > Currently, the # of slots reserved for aux processes is sized by > NUM_AUXILIARY_PROCS, which is one smaller than the number of different aux > proces kinds: > > /* > > * We set aside some extra PGPROC structures for auxiliary processes, > > * ie things that aren't full-fledged backends but need shmem access. > > * > > * Background writer, checkpointer, WAL writer, WAL summarizer, and archiver > > * run during normal operation. Startup process and WAL receiver also consume > > * 2 slots, but WAL writer is launched only after startup has exited, so we > > * only need 6 slots. > > */ > > #define NUM_AUXILIARY_PROCS 6 > > For PMChildSlot numbers, we could certainly just allocate one more slot. > > It would probably make sense for PGPROCs too, even though PGPROC is a much > larger struct. I don't think it's worth worrying about that much. PGPROC is large, but not *that* large. And the robustness win of properly detecting when there's a problem around starting/stopping aux workers seems to outweigh that to me. > > > +PMChild * > > > +AssignPostmasterChildSlot(BackendType btype) > > > +{ > > > + dlist_head *freelist; > > > + PMChild *pmchild; > > > + > > > + freelist = GetFreeList(btype); > > > + > > > + if (dlist_is_empty(freelist)) > > > + return NULL; > > > + > > > + pmchild = dlist_container(PMChild, elem, dlist_pop_head_node(freelist)); > > > + pmchild->pid = 0; > > > + pmchild->bkend_type = btype; > > > + pmchild->rw = NULL; > > > + pmchild->bgworker_notify = true; > > > + > > > + /* > > > + * pmchild->child_slot for each entry was initialized when the array of > > > + * slots was allocated. > > > + */ > > > + > > > + dlist_push_head(&ActiveChildList, &pmchild->elem); > > > + > > > + ReservePostmasterChildSlot(pmchild->child_slot); > > > + > > > + /* FIXME: find a more elegant way to pass this */ > > > + MyPMChildSlot = pmchild->child_slot; > > > > What if we assigned one offset for each process and assigned its ID here and > > also used that for its ProcNumber - that way we wouldn't need to manage > > freelists in two places. > > It's currently possible to have up to 2 * max_connections backends in the > authentication phase. We would have to change that behaviour, or make the > PGPROC array 2x larger. That however, might be too much... > It might well be worth it, I don't know how sensible the current behaviour > is. But I'd like to punt that to later patch, to keep the scope of this > patch set reasonable. It's pretty straightforward to do later on top of this > if we want to. Makes sense. I still think that we'd be better off to just return an error to the client in postmaster, rather than deal with this dead-end children mess. That was perhaps justified at some point, but now it seems to add way more complexity than it's worth. And it's absurdly expensive to fork to return an error. Way more expensive than just having postmaster send an error and close the socket. > > > @@ -2469,11 +2410,15 @@ process_pm_child_exit(void) > > > } > > > /* Was it the system logger? If so, try to start a new one */ > > > - if (pid == SysLoggerPID) > > > + if (SysLoggerPMChild && pid == SysLoggerPMChild->pid) > > > { > > > - SysLoggerPID = 0; > > > /* for safety's sake, launch new logger *first* */ > > > - SysLoggerPID = SysLogger_Start(); > > > + SysLoggerPMChild->pid = SysLogger_Start(); > > > + if (SysLoggerPMChild->pid == 0) > > > + { > > > + FreePostmasterChildSlot(SysLoggerPMChild); > > > + SysLoggerPMChild = NULL; > > > + } > > > if (!EXIT_STATUS_0(exitstatus)) > > > LogChildExit(LOG, _("system logger process"), > > > > Seems a bit weird to have one place with a different memory lifetime handling > > than other places. Why don't we just do this the same way as in other places > > but continue to defer the logging until after we tried to start the new > > logger? > > Hmm, you mean let LaunchMissingBackgroundProcesses() handle the restart? Yea - which it already can do, presumably to handle the case of logging_collector. It just seems odd to have code to have three places calling SysLogger_Start() - with some mild variations of the code. Perhaps we can at least centralize some of that? But you have a point with: > I'm a little scared of changing the existing logic. We don't have a > mechanism for deferring logging, so we would have to invent that, or the > logs would just accumulate in the pipe until syslogger starts up. There's > some code between here and LaunchMissingBackgroundProcesses(), so might > postmaster get blocked between writing to the syslogger pipe, before having > restarted it? > > If forking the syslogger process fails, that can happen anyway, though. > > > /* Construct a process name for log message */ > > > + > > > + /* > > > + * FIXME: use GetBackendTypeDesc here? How does the localization of that > > > + * work? > > > + */ > > > if (bp->bkend_type == B_DEAD_END_BACKEND) > > > { > > > procname = _("dead end backend"); > > > > Might be worth having a version of GetBackendTypeDesc() that returns a > > translated string? > > Constructing the string for background workers is a little more complicated: Random aside: I *hate* that there's no trivial way recognie background workers in pg_stat_activity, because somebody made pg_stat_activity.backend_type report something completely under control of extensions... > > > @@ -2697,9 +2643,16 @@ HandleChildCrash(int pid, int exitstatus, const char *procname) > > > { > > > dlist_iter iter; > > > - dlist_foreach(iter, &BackendList) > > > + dlist_foreach(iter, &ActiveChildList) > > > { > > > - Backend *bp = dlist_container(Backend, elem, iter.cur); > > > + PMChild *bp = dlist_container(PMChild, elem, iter.cur); > > > + > > > + /* We do NOT restart the syslogger */ > > > + if (bp == SysLoggerPMChild) > > > + continue; > > > > That comment seems a bit misleading - we do restart syslogger, we just don't > > do it here, no? I realize it's an old comment, but it still seems like it's > > worth fixing given that you touch all the code here... > > No, we really do not restart the syslogger. Hm? /* Was it the system logger? If so, try to start a new one */ if (SysLoggerPMChild && pid == SysLoggerPMChild->pid) { /* for safety's sake, launch new logger *first* */ SysLoggerPMChild->pid = SysLogger_Start(SysLoggerPMChild->child_slot); if (SysLoggerPMChild->pid == 0) { FreePostmasterChildSlot(SysLoggerPMChild); SysLoggerPMChild = NULL; } if (!EXIT_STATUS_0(exitstatus)) LogChildExit(LOG, _("system logger process"), pid, exitstatus); continue; } We don't do it reaction to other processes crashing, but we still restart it if it dies. Perhaps it's clear from context - but I had to think aobut it for a moment. > > > > @@ -2871,29 +2786,27 @@ PostmasterStateMachine(void) > > > <snip> > > > - if (WalSummarizerPID != 0) > > > - signal_child(WalSummarizerPID, SIGTERM); > > > - if (SlotSyncWorkerPID != 0) > > > - signal_child(SlotSyncWorkerPID, SIGTERM); > > > + targetMask |= (1 << B_STARTUP); > > > + targetMask |= (1 << B_WAL_RECEIVER); > > > + > > > + targetMask |= (1 << B_WAL_SUMMARIZER); > > > + targetMask |= (1 << B_SLOTSYNC_WORKER); > > > /* checkpointer, archiver, stats, and syslogger may continue for now */ > > > + SignalSomeChildren(SIGTERM, targetMask); > > > + > > > /* Now transition to PM_WAIT_BACKENDS state to wait for them to die */ > > > pmState = PM_WAIT_BACKENDS; > > > <snip> > > > > It's likely the right thing to not do as one patch, but IMO this really wants > > to be a state table. Perhaps as part of child_process_kinds, perhaps separate > > from that. > > Yeah. I've tried to refactor this into a table before, but didn't come up > with anything that I was happy with. I also feel there must be a better way > to organize this, but not sure what exactly. I hope that will become more > apparent after these other changes. What I'm imagining is something like: 1) Make PMState values each have a distinct bit 2) Move PMState to some (new?) header 3) Add a "uint32 should_run" member to child_process_kind that's a bitmask of PMStates 4) Add a new function in launch_backend.c that gets passed the "target" PMState and returns a bitmask of the tasks that should be running (or the inverse, doesn't really matter). 5) Instead of open-coding the targetMask "computation", use the new function from 4). I think that might not look too bad? Greetings, Andres Freund
On Tue, Sep 10, 2024 at 12:59 PM Andres Freund <andres@anarazel.de> wrote: > I still think that we'd be better off to just return an error to the client in > postmaster, rather than deal with this dead-end children mess. That was > perhaps justified at some point, but now it seems to add way more complexity > than it's worth. And it's absurdly expensive to fork to return an error. Way > more expensive than just having postmaster send an error and close the socket. The tricky case is the one where the client write() -- or SSL_write() -- blocks. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2024-08-12 12:55:00 +0300, Heikki Linnakangas wrote: > @@ -2864,6 +2777,8 @@ PostmasterStateMachine(void) > */ > if (pmState == PM_STOP_BACKENDS) > { > + uint32 targetMask; > + > /* > * Forget any pending requests for background workers, since we're no > * longer willing to launch any new workers. (If additional requests > @@ -2871,29 +2786,27 @@ PostmasterStateMachine(void) > */ > ForgetUnstartedBackgroundWorkers(); > > - /* Signal all backend children except walsenders and dead-end backends */ > - SignalSomeChildren(SIGTERM, > - BACKEND_TYPE_ALL & ~(1 << B_WAL_SENDER | 1 << B_DEAD_END_BACKEND)); > + /* Signal all backend children except walsenders */ > + /* dead-end children are not signalled yet */ > + targetMask = (1 << B_BACKEND); > + targetMask |= (1 << B_BG_WORKER); > + > /* and the autovac launcher too */ > - if (AutoVacPID != 0) > - signal_child(AutoVacPID, SIGTERM); > + targetMask |= (1 << B_AUTOVAC_LAUNCHER); > /* and the bgwriter too */ > - if (BgWriterPID != 0) > - signal_child(BgWriterPID, SIGTERM); > + targetMask |= (1 << B_BG_WRITER); > /* and the walwriter too */ > - if (WalWriterPID != 0) > - signal_child(WalWriterPID, SIGTERM); > + targetMask |= (1 << B_WAL_WRITER); > /* If we're in recovery, also stop startup and walreceiver procs */ > - if (StartupPID != 0) > - signal_child(StartupPID, SIGTERM); > - if (WalReceiverPID != 0) > - signal_child(WalReceiverPID, SIGTERM); > - if (WalSummarizerPID != 0) > - signal_child(WalSummarizerPID, SIGTERM); > - if (SlotSyncWorkerPID != 0) > - signal_child(SlotSyncWorkerPID, SIGTERM); > + targetMask |= (1 << B_STARTUP); > + targetMask |= (1 << B_WAL_RECEIVER); > + > + targetMask |= (1 << B_WAL_SUMMARIZER); > + targetMask |= (1 << B_SLOTSYNC_WORKER); > /* checkpointer, archiver, stats, and syslogger may continue for now */ > > + SignalSomeChildren(SIGTERM, targetMask); > + > /* Now transition to PM_WAIT_BACKENDS state to wait for them to die */ > pmState = PM_WAIT_BACKENDS; > } I think this might now omit shutting down at least autovac workers, which afaict previously were included. Greetings, Andres Freund
Hi, On 2024-09-10 13:33:36 -0400, Robert Haas wrote: > On Tue, Sep 10, 2024 at 12:59 PM Andres Freund <andres@anarazel.de> wrote: > > I still think that we'd be better off to just return an error to the client in > > postmaster, rather than deal with this dead-end children mess. That was > > perhaps justified at some point, but now it seems to add way more complexity > > than it's worth. And it's absurdly expensive to fork to return an error. Way > > more expensive than just having postmaster send an error and close the socket. > > The tricky case is the one where the client write() -- or SSL_write() -- blocks. Yea, SSL definitely does make it harder. But it's not exactly rocket science to do non-blocking SSL connection establishment. After all, we do manage to do so in libpq... Greetings, Andres Freund
On Sat, Oct 5, 2024 at 7:41 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > My test for dead-end backends opens 20 TCP (or unix domain) connections > to the server, in quick succession. That works fine my system, and it > passed cirrus CI on other platforms, but on FreeBSD it failed > repeatedly. The behavior in that scenario is apparently > platform-dependent: it depends on the accept queue size, but what > happens when you reach the queue size also seems to depend on the > platform. On my Linux system, the connect() calls in the client are > blocked, if the server is doesn't call accept() fast enough, but > apparently you get an error on *BSD systems. Right, we've analysed that difference in AF_UNIX implementation before[1], which shows up in the real world, where client sockets ie libpq's are usually non-blocking, as EAGAIN on Linux (which is not valid per POSIX) vs ECONNREFUSED on other OSes. All fail to connect, but the error message is different. For blocking AF_UNIX client sockets like in your test, Linux effectively has an infinite queue made from two layers. The listen queue (a queue of connecting sockets) does respect the requested backlog size, but when it's full it has an extra trick: the connect() call waits (in a queue of threads) for space to become free in the listen queue, so it's effectively unlimited (but only for blocking sockets), while FreeBSD and I suspect any other implementation deriving from or reimplementing the BSD socket code gives you ECONNREFUSED. macOS behaves just the same as FreeBSD AFAICT, so I don't know why you didn't see the same thing... I guess it's just racing against accept() draining the queue. It's possible that Windows copied the Linux behaviour for AF_UNIX, given that it probably has something to do with the WSL project for emulating Linux, but IDK. [1] https://www.postgresql.org/message-id/flat/CADc_NKg2d%2BoZY9mg4DdQdoUcGzN2kOYXBu-3--RW_hEe0tUV%3Dg%40mail.gmail.com > I'm not sure of the exact details, but in any case, platform-dependent > behavior needs to be avoided in tests. So I changed the test so that it > sends an SSLRequest packet on each connection and waits for reply (which > is always 'N' to reject it in this test), before opening the next > connection. This way, each connection is still left hanging, which is > what I want in this test, but only after postmaster has successfully > accept()ed it and forked the backend. Makes sense.
On 05/10/2024 01:03, Thomas Munro wrote: > On Sat, Oct 5, 2024 at 7:41 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> My test for dead-end backends opens 20 TCP (or unix domain) connections >> to the server, in quick succession. That works fine my system, and it >> passed cirrus CI on other platforms, but on FreeBSD it failed >> repeatedly. The behavior in that scenario is apparently >> platform-dependent: it depends on the accept queue size, but what >> happens when you reach the queue size also seems to depend on the >> platform. On my Linux system, the connect() calls in the client are >> blocked, if the server is doesn't call accept() fast enough, but >> apparently you get an error on *BSD systems. > > Right, we've analysed that difference in AF_UNIX implementation > before[1], which shows up in the real world, where client sockets ie > libpq's are usually non-blocking, as EAGAIN on Linux (which is not > valid per POSIX) vs ECONNREFUSED on other OSes. All fail to connect, > but the error message is different. Thanks for the pointer! > For blocking AF_UNIX client sockets like in your test, Linux > effectively has an infinite queue made from two layers. The listen > queue (a queue of connecting sockets) does respect the requested > backlog size, but when it's full it has an extra trick: the connect() > call waits (in a queue of threads) for space to become free in the > listen queue, so it's effectively unlimited (but only for blocking > sockets), while FreeBSD and I suspect any other implementation > deriving from or reimplementing the BSD socket code gives you > ECONNREFUSED. macOS behaves just the same as FreeBSD AFAICT, so I > don't know why you didn't see the same thing... I guess it's just > racing against accept() draining the queue. In fact I misremembered: the failure happened on macOS, *not* FreeBSD. It could be just luck I didn't see it on FreeBSD though. > It's possible that Windows copied the Linux behaviour for AF_UNIX, > given that it probably has something to do with the WSL project for > emulating Linux, but IDK. Sadly Windows' IO::Socket::UNIX hasn't been implemented on Windows (or at least on this perl distribution we're using in Cirrus CI): Socket::pack_sockaddr_un not implemented on this architecture at C:/strawberry/5.26.3.1/perl/lib/Socket.pm line 872. so I'll have to disable this test on Windows anyway. -- Heikki Linnakangas Neon (https://neon.tech)
Heikki Linnakangas <hlinnaka@iki.fi> writes: > On 05/10/2024 01:03, Thomas Munro wrote: > >> It's possible that Windows copied the Linux behaviour for AF_UNIX, >> given that it probably has something to do with the WSL project for >> emulating Linux, but IDK. > > Sadly Windows' IO::Socket::UNIX hasn't been implemented on Windows (or > at least on this perl distribution we're using in Cirrus CI): > > Socket::pack_sockaddr_un not implemented on this architecture at > C:/strawberry/5.26.3.1/perl/lib/Socket.pm line 872. > > so I'll have to disable this test on Windows anyway. Socket version 2.028 (included in Perl 5.32) provides pack_sockaddr_un() on Windows, so that can be fixed by bumping the Perl version in https://github.com/anarazel/pg-vm-images/blob/main/packer/windows.pkr.hcl to something more modern (such as 5.40.0.1), and only skipping the test if on Windows if Socket is too old. The decision to use 5.26 seems to come from the initial creation of the CI images in 2021 (when 5.34 was current), with the comment «newer versions don't currently work correctly for plperl». That claim is worth revisiting, and fixing if it's still the case. - ilmari
Hi, On 2024-10-05 20:51:50 +0100, Dagfinn Ilmari Mannsåker wrote: > Socket version 2.028 (included in Perl 5.32) provides pack_sockaddr_un() > on Windows, so that can be fixed by bumping the Perl version in > https://github.com/anarazel/pg-vm-images/blob/main/packer/windows.pkr.hcl > to something more modern (such as 5.40.0.1), and only skipping the test > if on Windows if Socket is too old. > > The decision to use 5.26 seems to come from the initial creation of the > CI images in 2021 (when 5.34 was current), with the comment «newer > versions don't currently work correctly for plperl». That claim is > worth revisiting, and fixing if it's still the case. I think we fixed the issues that were known at the time. I think I tried upgrading to something newer at some point and there were some weird, but fixable, encoding issues. Unfortunately I don't have the bandwidth to tackle this rn. Greetings, Andres Freund
Hi, On Tue, 8 Oct 2024 at 00:55, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 05/10/2024 22:51, Dagfinn Ilmari Mannsåker wrote: > > Heikki Linnakangas <hlinnaka@iki.fi> writes: > >> Sadly Windows' IO::Socket::UNIX hasn't been implemented on Windows (or > >> at least on this perl distribution we're using in Cirrus CI): > >> > >> Socket::pack_sockaddr_un not implemented on this architecture at > >> C:/strawberry/5.26.3.1/perl/lib/Socket.pm line 872. > >> > >> so I'll have to disable this test on Windows anyway. > > > > Socket version 2.028 (included in Perl 5.32) provides pack_sockaddr_un() > > on Windows, so that can be fixed by bumping the Perl version in > > https://github.com/anarazel/pg-vm-images/blob/main/packer/windows.pkr.hcl > > to something more modern (such as 5.40.0.1), and only skipping the test > > if on Windows if Socket is too old. > > > > The decision to use 5.26 seems to come from the initial creation of the > > CI images in 2021 (when 5.34 was current), with the comment «newer > > versions don't currently work correctly for plperl». That claim is > > worth revisiting, and fixing if it's still the case. > > Yeah, it would be nice to update it. I wonder if commit > 341f4e002d461a3c5513cb864490cddae2b43a64 fixed whatever the problem was. The perl version in Windows CI image is bumped to 5.40.0.1 [1]. So, the related test passes on Windows now [2]. [1] https://github.com/anarazel/pg-vm-images/commit/cbd5d46f2fb7b28efb126ddac64d12711247dfa8 [2] https://cirrus-ci.com/task/5682393120505856?logs=check_world#L241 -- Regards, Nazir Bilal Yavuz Microsoft
On 09/10/2024 23:40, Heikki Linnakangas wrote: > I pushed the first three patches, with the new test and one of the small > refactoring patches. Thanks for all the comments so far! Here is a new > version of the remaining patches. > > Lots of little cleanups and changes here and there since the last > versions, but the notable bigger changes are: > > - There is now a BackendTypeMask datatype, so that if you try to mix up > bitmasks and plain BackendType values, the compiler will complain. > > - pmchild.c has been rewritten per feedback, so that the "pools" of > PMChild structs are more explicit. The size of each pool is only stated > once, whereas before the same logic was duplicated in > MaxLivePostmasterChildren() which calculates the number of slots and in > InitPostmasterChildSlots() which allocates them. > > - In PostmasterStateMachine(), I combined the code to handle > PM_STOP_BACKENDS and PM_WAIT_BACKENDS. They are essentially the same > state, except that PM_STOP_BACKENDS first sends the signal to all the > child processes that it will then wait for. They both needed to build > the same bitmask of processes to signal or wait for; this eliminates the > duplication. Made a few more changes since last patch version: - Fixed initialization in pmchild.c in single-user and bootstrapping mode - inlined assign_backendlist_entry() into its only caller; it wasn't doing much anymore - cleaned up some leftovers in canAcceptConnections() - Renamed some functions for clarity, fixed some leftover comments that still talked about Backend structs and BackendList With those changes, committed. Thanks for the review! -- Heikki Linnakangas Neon (https://neon.tech)
On 11/14/24 15:13, Heikki Linnakangas wrote: > On 09/10/2024 23:40, Heikki Linnakangas wrote: >> I pushed the first three patches, with the new test and one of the small >> refactoring patches. Thanks for all the comments so far! Here is a new >> version of the remaining patches. >> Hi, the TAP test 001_connection_limits.pl introduced by 6a1d0d470e84 seems to have problems with valgrind :-( I reliably get this failure: t/001_connection_limits.pl .. 3/? # Tests were run but no plan was declared and done_testing() was not seen. # Looks like your test exited with 29 just after 4. t/001_connection_limits.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00) All 4 subtests passed and tmp_check/log/regress_log_001_connection_limits says: [23:48:44.444](1.129s) ok 3 - reserved_connections limit [23:48:44.445](0.001s) ok 4 - reserved_connections limit: matches process ended prematurely at /home/user/work/postgres/src/test/postmaster/../../../src/test/perl/PostgreSQL/Test/BackgroundPsql.pm line 154. # Postmaster PID for node "primary" is 198592 That BackgroundPsql.pm line is this in wait_connect() $self->{run}->pump() until $self->{stdout} =~ /$banner/ || $self->{timeout}->is_expired; By trial and error I found that it fails on this line 70: push(@sessions, background_psql_as_user('regress_superuser')); but I have no idea idea why. There are multiple similar calls a couple lines earlier, and those work fine. And various other TAP tests with background_sql() work fine too. So what's so special about this particular line? regards -- Tomas Vondra
On 09/12/2024 01:12, Tomas Vondra wrote: > On 11/14/24 15:13, Heikki Linnakangas wrote: >> On 09/10/2024 23:40, Heikki Linnakangas wrote: >>> I pushed the first three patches, with the new test and one of the small >>> refactoring patches. Thanks for all the comments so far! Here is a new >>> version of the remaining patches. >>> > Hi, the TAP test 001_connection_limits.pl introduced by 6a1d0d470e84 > seems to have problems with valgrind :-( I reliably get this failure: How exactly do you run the test with valgrind? What platform? It works for me, with this: (cd build && ninja && rm -rf tmp_install && meson test --suite setup && valgrind --leak-check=no --gen-suppressions=all --suppressions=/home/heikki/git-sandbox/postgresql/src/tools/valgrind.supp --time-stamp=yes --error-markers=VALGRINDERROR-BEGIN,VALGRINDERROR-END --log-file=$HOME/pg-valgrind/%p.log --trace-children=yes meson test --suite postmaster ) > t/001_connection_limits.pl .. 3/? # Tests were run but no plan was > declared and done_testing() was not seen. > # Looks like your test exited with 29 just after 4. > t/001_connection_limits.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00) > All 4 subtests passed > > > and tmp_check/log/regress_log_001_connection_limits says: > > > [23:48:44.444](1.129s) ok 3 - reserved_connections limit > [23:48:44.445](0.001s) ok 4 - reserved_connections limit: matches > process ended prematurely at > /home/user/work/postgres/src/test/postmaster/../../../src/test/perl/PostgreSQL/Test/BackgroundPsql.pm > line 154. > # Postmaster PID for node "primary" is 198592 > > > That BackgroundPsql.pm line is this in wait_connect() > > $self->{run}->pump() > until $self->{stdout} =~ /$banner/ || $self->{timeout}->is_expired; > > By trial and error I found that it fails on this line 70: > > push(@sessions, background_psql_as_user('regress_superuser')); > > but I have no idea idea why. There are multiple similar calls a couple > lines earlier, and those work fine. And various other TAP tests with > background_sql() work fine too. > > So what's so special about this particular line? Weird. Valgrind makes everything slow; is it a timeout? Any other clues in the logs? -- Heikki Linnakangas Neon (https://neon.tech)
On 09/12/2024 14:47, Tomas Vondra wrote: > On 12/9/24 13:30, Heikki Linnakangas wrote: >> On 09/12/2024 01:12, Tomas Vondra wrote: >>> On 11/14/24 15:13, Heikki Linnakangas wrote: >>>> On 09/10/2024 23:40, Heikki Linnakangas wrote: >>>>> I pushed the first three patches, with the new test and one of the >>>>> small >>>>> refactoring patches. Thanks for all the comments so far! Here is a new >>>>> version of the remaining patches. >>>>> >>> Hi, the TAP test 001_connection_limits.pl introduced by 6a1d0d470e84 >>> seems to have problems with valgrind :-( I reliably get this failure: >> >> How exactly do you run the test with valgrind? What platform? >> > > It failed for me on both amd64 (Fedora 41) and rpi5 32/64-bit (Debian). > >> It works for me, with this: >> >> (cd build && ninja && rm -rf tmp_install && meson test --suite setup && >> valgrind --leak-check=no --gen-suppressions=all --suppressions=/home/ >> heikki/git-sandbox/postgresql/src/tools/valgrind.supp --time-stamp=yes >> --error-markers=VALGRINDERROR-BEGIN,VALGRINDERROR-END --log-file=$HOME/ >> pg-valgrind/%p.log --trace-children=yes meson test --suite postmaster ) >> > > I have a patch that tweaks pg_ctl/pg_regress to execute valgrind, so I > just do > > ./configure --enable-debug --prefix=/home/user/builds/master > --enable-depend --enable-cassert --enable-tap-tests CPPFLAGS="-O0 -ggdb3 > -DUSE_VALGRIND" > > and then the usual "make check" or whatever. > > The patch has a hardcoded path to the .supp file, and places the > valgrind log into /tmp. It has worked for me fine up until that commit, > and it still seems to be working in every other test directory. Ok, I was able to reproduce this with that setup. Unsurprisingly, it's a timing issue. It can be reproduced without valgrind by adding this delay: diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c index 289059435a9..1eb6bad72ca 100644 --- a/src/backend/utils/error/elog.c +++ b/src/backend/utils/error/elog.c @@ -583,6 +583,7 @@ errfinish(const char *filename, int lineno, const char *funcname) * FATAL termination. The postmaster may or may not consider this * worthy of panic, depending on which subprocess returns it. */ + sleep(1); proc_exit(1); } The test opens a connection that is expected to fail with the "remaining connection slots are reserved for roles with the SUPERUSER attribute" error. Right after that, it opens a new connection as superuser, and expects it to succeed. But if the previous backend hasn't exited yet, the new connection fails with "too many clients already". Not sure how to fix this. A small sleep in the test would work, but in principle there's no delay that's guaranteed to be enough. A more robust solution would be to run a "select count(*) from pg_stat_activity" and wait until the number of connections are what's expected. I'll try that and see how complicated that gets.. -- Heikki Linnakangas Neon (https://neon.tech)
On 09/12/2024 22:55, Heikki Linnakangas wrote: > Not sure how to fix this. A small sleep in the test would work, but in > principle there's no delay that's guaranteed to be enough. A more robust > solution would be to run a "select count(*) from pg_stat_activity" and > wait until the number of connections are what's expected. I'll try that > and see how complicated that gets.. Checking pg_stat_activity doesn't help, because the backend doesn't register itself in pg_stat_activity until later. A connection that's rejected due to connection limits never shows up in pg_stat_activity. Some options: 0. Do nothing 1. Add a small sleep to the test 2. Move the pgstat_bestart() call earlier in the startup sequence, so that a backend shows up in pg_stat_activity before it acquires a PGPROC entry, and stays visible until after it has released its PGPROC entry. This would give more visibility to backends that are starting up. 3. Rearrange the FATAL error handling so that the process removes itself from PGPROC before sending the error to the client. That would be kind of nice anyway. Currently, if sending the rejection error message to the client blocks, you are holding up a PGPROC slot until the message is sent. The error message packet is short, so it's highly unlikely to block, but still. Option 3 seems kind of nice in principle, but looking at the code, it's a bit awkward to implement. Easiest way to implement it would be to modify send_message_to_frontend() to not call pq_flush() on FATAL errors, and flush the data in socket_close() instead. Not a lot of code, but it's a pretty ugly special case. Option 2 seems nice too, but seems like a lot of work. -- Heikki Linnakangas Neon (https://neon.tech)
On 12/10/24 11:00, Heikki Linnakangas wrote: > On 09/12/2024 22:55, Heikki Linnakangas wrote: >> Not sure how to fix this. A small sleep in the test would work, but in >> principle there's no delay that's guaranteed to be enough. A more >> robust solution would be to run a "select count(*) from >> pg_stat_activity" and wait until the number of connections are what's >> expected. I'll try that and see how complicated that gets.. > > Checking pg_stat_activity doesn't help, because the backend doesn't > register itself in pg_stat_activity until later. A connection that's > rejected due to connection limits never shows up in pg_stat_activity. > > Some options: > > 0. Do nothing > > 1. Add a small sleep to the test > I'd just add a short sleep. Yeah, sleeps are not great, but everything else seems like a lot of effort just to make this one test pass under valgrind, and I don't think it's worth it. Can we make the sleep conditional on valgrind, so that regular builds are not affected? I guess regular builds could fail too, but I don't think we've seen such failures until now. regards -- Tomas Vondra