Thread: Refactoring postmaster's code to cleanup after child exit
Reading through postmaster code, I spotted some refactoring opportunities to make it slightly more readable. Currently, when a child process exits, the postmaster first scans through BackgroundWorkerList to see if it was a bgworker process. If not found, it scans through the BackendList to see if it was a backend process (which it really should be then). That feels a bit silly, because every running background worker process also has an entry in BackendList. There's a lot of duplication between CleanupBackgroundWorker and CleanupBackend. Before commit 8a02b3d732, we used to created Backend entries only for background worker processes that connected to a database, not for other background worker processes. I think that's why we have the code structure we have. But now that we have a Backend entry for all bgworker processes, it's more natural to have single function to deal with both regular backends and bgworkers. So I came up with the attached patches. This doesn't make any meaningful user-visible changes, except for some incidental changes in log messages (see commit message for details). -- Heikki Linnakangas Neon (https://neon.tech)
Attachment
On 06/07/2024 22:01, Heikki Linnakangas wrote: > Reading through postmaster code, I spotted some refactoring > opportunities to make it slightly more readable. > > Currently, when a child process exits, the postmaster first scans > through BackgroundWorkerList to see if it was a bgworker process. If not > found, it scans through the BackendList to see if it was a backend > process (which it really should be then). That feels a bit silly, > because every running background worker process also has an entry in > BackendList. There's a lot of duplication between > CleanupBackgroundWorker and CleanupBackend. > > Before commit 8a02b3d732, we used to created Backend entries only for > background worker processes that connected to a database, not for other > background worker processes. I think that's why we have the code > structure we have. But now that we have a Backend entry for all bgworker > processes, it's more natural to have single function to deal with both > regular backends and bgworkers. > > So I came up with the attached patches. This doesn't make any meaningful > user-visible changes, except for some incidental changes in log messages > (see commit message for details). New patch version attached. Fixed conflicts with recent commits, no other changes. -- Heikki Linnakangas Neon (https://neon.tech)
Attachment
I committed the first two trivial patches, and have continued to work on postmaster.c, and how it manages all the child processes. This is a lot of patches. They're built on top of each other, because that's the order I developed them in, but they probably could be applied in different order too. Please help me by reviewing these, before the stack grows even larger :-). Even partial reviews would be very helpful. I suggest to start reading them in order, and when you get tired, just send any comments you have up to that point. * v3-0001-Make-BackgroundWorkerList-doubly-linked.patch This is the same refactoring patch I started this thread with. * v3-0003-Fix-comment-on-processes-being-kept-over-a-restar.patch * v3-0004-Consolidate-postmaster-code-to-launch-background-.patch Little refactoring of how postmaster launches the background processes. * v3-0005-Add-test-for-connection-limits.patch * v3-0006-Add-test-for-dead-end-backends.patch A few new TAP tests for dead-end backends and enforcing connection limits. We didn't have much coverage for these before. * v3-0007-Use-an-shmem_exit-callback-to-remove-backend-from.patch * v3-0008-Introduce-a-separate-BackendType-for-dead-end-chi.patch Some preliminary refactoring towards patch v3-0010-Assign-a-child-slot-to-every-postmaster-child-pro.patch * v3-0009-Kill-dead-end-children-when-there-s-nothing-else-.patch I noticed that we never send SIGTERM or SIGQUIT to dead-end backends, which seems silly. If the server is shutting down, dead-end backends might prevent the shutdown from completing. Dead-end backends will expire after authentication_timoeut (default 60s), so it won't last for too long, but still seems like we should kill dead-end backends if they're the only children preventing shutdown from completing. * 3-0010-Assign-a-child-slot-to-every-postmaster-child-pro.patch This is what I consider the main patch in this series. Currently, only regular backens, bgworkers and autovacuum workers have a PMChildFlags slot, which is used to detect when a postmaster child exits in an unclean way (in addition to the exit code). This patch assigns a child slot for all processes, except for dead-end backends. That includes all the aux processes. While we're at it, I created separate pools of child slots for different kinds of backends, which fixes the issue that opening a lot of client connections can exhaust all the slots, so that background workers or autovacuum workers cannot start either [1]. [1] https://www.postgresql.org/message-id/55d2f50c-0b81-4b33-b202-cd2a406d69a3%40iki.fi * v3-0011-Pass-MyPMChildSlot-as-an-explicit-argument-to-chi.patch One more little refactoring, to pass MyPMChildSlot to the child process differently. Where is all this leading? I'm not sure exactly, but having a postmaster child slot for every postmaster child seems highly useful. We could move the ProcSignal machinery to use those slot numbers for the indexes to the ProcSignal array, instead of ProcSignal, for example. That would allow all processes to participate in the signalling, even before they have a PGPROC entry. (Or with Thomas's interrupts refactoring, the interrupts array). With the multithreading work, PMChild struct could store a thread id, or whatever is needed for threads to communicate with each other. In any case, seems like it will come handy. -- Heikki Linnakangas Neon (https://neon.tech)
Attachment
- v3-0001-Make-BackgroundWorkerList-doubly-linked.patch
- v3-0002-Refactor-code-to-handle-death-of-a-backend-or-bgw.patch
- v3-0003-Fix-comment-on-processes-being-kept-over-a-restar.patch
- v3-0004-Consolidate-postmaster-code-to-launch-background-.patch
- v3-0005-Add-test-for-connection-limits.patch
- v3-0006-Add-test-for-dead-end-backends.patch
- v3-0007-Use-an-shmem_exit-callback-to-remove-backend-from.patch
- v3-0008-Introduce-a-separate-BackendType-for-dead-end-chi.patch
- v3-0009-Kill-dead-end-children-when-there-s-nothing-else-.patch
- v3-0010-Assign-a-child-slot-to-every-postmaster-child-pro.patch
- v3-0011-Pass-MyPMChildSlot-as-an-explicit-argument-to-chi.patch
On Fri, Aug 2, 2024 at 11:57 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > * v3-0001-Make-BackgroundWorkerList-doubly-linked.patch LGTM. > [v3-0002-Refactor-code-to-handle-death-of-a-backend-or-bgw.patch] Currently, when a child process exits, the postmaster first scans through BackgroundWorkerList, to see if it the child process was a background worker. If not found, then it scans through BackendList to see if it was a regular backend. That leads to some duplication between the bgworker and regular backend cleanup code, as both have an entry in the BackendList that needs to be cleaned up in the same way. Refactor that so that we scan just the BackendList to find the child process, and if it was a background worker, do the additional bgworker-specific cleanup in addition to the normal Backend cleanup. Makes sense. On Windows, if a child process exits with ERROR_WAIT_NO_CHILDREN, it's now logged with that exit code, instead of 0. Also, if a bgworker exits with ERROR_WAIT_NO_CHILDREN, it's now treated as crashed and is restarted. Previously it was treated as a normal exit. Interesting. So when that error was first specially handled in this thread: https://www.postgresql.org/message-id/flat/AANLkTimCTkNKKrHCd3Ot6kAsrSS7SeDpOTcaLsEP7i%2BM%40mail.gmail.com#41f60947571b75377f04af67ba6baf40 ... it went from being considered a crash, to being considered like exit(0). It's true that shared memory can't be corrupted by a process that never enters main(), but it's better not to hide the true reason for the failure (if it is still possible -- I don't find many references to that phenomenon in recent times). Clobbering exitstatus with 0 doesn't seem right at all, now that we have background workers whose restart behaviour is affected by that. If a child process is not found in the BackendList, the log message now calls it "untracked child process" rather than "server process". Arguably that should be a PANIC, because we do track all the child processes in the list, so failing to find a child process is highly unexpected. But if we want to change that, let's discuss and do that as a separate commit. Yeah, it would be highly unexpected if waitpid() told you about some random other process (or we screwed up the bookkeeping and didn't recognise it). So at least having a different message seems good. > * v3-0003-Fix-comment-on-processes-being-kept-over-a-restar.patch +1 > * v3-0004-Consolidate-postmaster-code-to-launch-background-.patch Much of the code in process_pm_child_exit() to launch replacement processes when one exits or when progressing to next postmaster state was unnecessary, because the ServerLoop will launch any missing background processes anyway. Remove the redundant code and let ServerLoop handle it. +1, makes sense. In ServerLoop, move the code to launch all the processes to a new subroutine, to group it all together. +1, makes sense. More soon...
On 08/08/2024 13:47, Thomas Munro wrote: > On Windows, if a child process exits with ERROR_WAIT_NO_CHILDREN, it's > now logged with that exit code, instead of 0. Also, if a bgworker > exits with ERROR_WAIT_NO_CHILDREN, it's now treated as crashed and is > restarted. Previously it was treated as a normal exit. > > Interesting. So when that error was first specially handled in this thread: > > https://www.postgresql.org/message-id/flat/AANLkTimCTkNKKrHCd3Ot6kAsrSS7SeDpOTcaLsEP7i%2BM%40mail.gmail.com#41f60947571b75377f04af67ba6baf40 > > ... it went from being considered a crash, to being considered like > exit(0). It's true that shared memory can't be corrupted by a process > that never enters main(), but it's better not to hide the true reason > for the failure (if it is still possible -- I don't find many > references to that phenomenon in recent times). Clobbering exitstatus > with 0 doesn't seem right at all, now that we have background workers > whose restart behaviour is affected by that. I adjusted this ERROR_WAIT_NO_CHILDREN a little more, to avoid logging the death of the child twice in some cases. >> * v3-0003-Fix-comment-on-processes-being-kept-over-a-restar.patch > > +1 Committed the patches up to and including this one, with tiny comment changes. >> * v3-0004-Consolidate-postmaster-code-to-launch-background-.patch > > Much of the code in process_pm_child_exit() to launch replacement > processes when one exits or when progressing to next postmaster state > was unnecessary, because the ServerLoop will launch any missing > background processes anyway. Remove the redundant code and let > ServerLoop handle it. I'm going to work a little more on the comments on this one before committing; I had just moved all the "If we have lost the XXX, try to start a new one" comments as is, but they look pretty repetitive now. Thanks for the review! -- Heikki Linnakangas Neon (https://neon.tech)
On 10/08/2024 00:13, Heikki Linnakangas wrote: > On 08/08/2024 13:47, Thomas Munro wrote: >>> * v3-0004-Consolidate-postmaster-code-to-launch-background-.patch >> >> Much of the code in process_pm_child_exit() to launch replacement >> processes when one exits or when progressing to next postmaster >> state >> was unnecessary, because the ServerLoop will launch any missing >> background processes anyway. Remove the redundant code and let >> ServerLoop handle it. > > I'm going to work a little more on the comments on this one before > committing; I had just moved all the "If we have lost the XXX, try to > start a new one" comments as is, but they look pretty repetitive now. Pushed this now, after adjusting the comments a bit. Thanks again for the review! Here are the remaining patches, rebased. > commit a1c43d65907d20a999b203e465db1277ec842a0a > Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Thu Aug 1 17:24:12 2024 +0300 > > Introduce a separate BackendType for dead-end children > > And replace postmaster.c's own "backend type" codes with BackendType > > XXX: While working on this, many times I accidentally did something > like "foo |= B_SOMETHING" instead of "foo |= 1 << B_SOMETHING", when > constructing arguments to SignalSomeChildren or CountChildren, and > things broke in very subtle ways taking a long time to debug. The old > constants that were already bitmasks avoided that. Maybe we need some > macro magic or something to make this less error-prone. While rebasing this today, I spotted another instance of that mistake mentioned in the XXX comment above. I called "CountChildren(B_BACKEND)" instead of "CountChildren(1 << B_BACKEND)". Some ideas on how to make that less error-prone: 1. Add a separate typedef for the bitmasks, and macros/functions to work with it. Something like: typedef struct { uint32 mask; } BackendTypeMask; static const BackendTypeMask BTMASK_ALL = { 0xffffffff }; static const BackendTypeMask BTMASK_NONE = { 0 }; static inline BackendTypeMask BTMASK_ADD(BackendTypeMask mask, BackendType t) { mask.mask |= 1 << t; return mask; } static inline BackendTypeMask BTMASK_DEL(BackendTypeMask mask, BackendType t) { mask.mask &= ~(1 << t); return mask; } Now the compiler will complain if you try to pass a BackendType for the mask. We could do this just for BackendType, or we could put this in src/include/lib/ with a more generic name, like "bitmask_u32". 2. Another idea is to redefine the BackendType values to be separate bits, like the current BACKEND_TYPE_* values in postmaster.c: typedef enum BackendType { B_INVALID = 0, /* Backends and other backend-like processes */ B_BACKEND = 1 << 1, B_DEAD_END_BACKEND = 1 << 2, B_AUTOVAC_LAUNCHER = 1 << 3, B_AUTOVAC_WORKER = 1 << 4, ... } BackendType; Then you can use | and & on BackendTypes directly. It makes it less clear which function arguments are a BackendType and which are a bitmask, however. Thoughts, other ideas? -- Heikki Linnakangas Neon (https://neon.tech)
Attachment
- v4-0001-Add-test-for-connection-limits.patch
- v4-0002-Add-test-for-dead-end-backends.patch
- v4-0003-Use-an-shmem_exit-callback-to-remove-backend-from.patch
- v4-0004-Introduce-a-separate-BackendType-for-dead-end-chi.patch
- v4-0005-Kill-dead-end-children-when-there-s-nothing-else-.patch
- v4-0006-Assign-a-child-slot-to-every-postmaster-child-pro.patch
- v4-0007-Pass-MyPMChildSlot-as-an-explicit-argument-to-chi.patch