Re: Refactoring postmaster's code to cleanup after child exit - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Refactoring postmaster's code to cleanup after child exit
Date
Msg-id CA+hUKGKWqSFoJXzRC9D0owFj1Xby0qvWr-k6p8z9gQzT7+nr7w@mail.gmail.com
Whole thread Raw
In response to Re: Refactoring postmaster's code to cleanup after child exit  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: Refactoring postmaster's code to cleanup after child exit
List pgsql-hackers
On Fri, Aug 2, 2024 at 11:57 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> * v3-0001-Make-BackgroundWorkerList-doubly-linked.patch

LGTM.

> [v3-0002-Refactor-code-to-handle-death-of-a-backend-or-bgw.patch]

    Currently, when a child process exits, the postmaster first scans
    through BackgroundWorkerList, to see if it the child process was a
    background worker. If not found, then it scans through BackendList to
    see if it was a regular backend. That leads to some duplication
    between the bgworker and regular backend cleanup code, as both have an
    entry in the BackendList that needs to be cleaned up in the same way.
    Refactor that so that we scan just the BackendList to find the child
    process, and if it was a background worker, do the additional
    bgworker-specific cleanup in addition to the normal Backend cleanup.

Makes sense.

    On Windows, if a child process exits with ERROR_WAIT_NO_CHILDREN, it's
    now logged with that exit code, instead of 0. Also, if a bgworker
    exits with ERROR_WAIT_NO_CHILDREN, it's now treated as crashed and is
    restarted. Previously it was treated as a normal exit.

Interesting.  So when that error was first specially handled in this thread:


https://www.postgresql.org/message-id/flat/AANLkTimCTkNKKrHCd3Ot6kAsrSS7SeDpOTcaLsEP7i%2BM%40mail.gmail.com#41f60947571b75377f04af67ba6baf40

... it went from being considered a crash, to being considered like
exit(0).  It's true that shared memory can't be corrupted by a process
that never enters main(), but it's better not to hide the true reason
for the failure (if it is still possible -- I don't find many
references to that phenomenon in recent times).  Clobbering exitstatus
with 0 doesn't seem right at all, now that we have background workers
whose restart behaviour is affected by that.

    If a child process is not found in the BackendList, the log message
    now calls it "untracked child process" rather than "server process".
    Arguably that should be a PANIC, because we do track all the child
    processes in the list, so failing to find a child process is highly
    unexpected. But if we want to change that, let's discuss and do that
    as a separate commit.

Yeah, it would be highly unexpected if waitpid() told you about some
random other process (or we screwed up the bookkeeping and didn't
recognise it).  So at least having a different message seems good.

> * v3-0003-Fix-comment-on-processes-being-kept-over-a-restar.patch

+1

> * v3-0004-Consolidate-postmaster-code-to-launch-background-.patch

    Much of the code in process_pm_child_exit() to launch replacement
    processes when one exits or when progressing to next postmaster state
    was unnecessary, because the ServerLoop will launch any missing
    background processes anyway. Remove the redundant code and let
    ServerLoop handle it.

+1, makes sense.

    In ServerLoop, move the code to launch all the processes to a new
    subroutine, to group it all together.

+1, makes sense.

More soon...



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Enable data checksums by default
Next
From: Shlok Kyal
Date:
Subject: Re: long-standing data loss bug in initial sync of logical replication