Thread: Refactoring postmaster's code to cleanup after child exit

Refactoring postmaster's code to cleanup after child exit

From

Heikki Linnakangas

Date:

06 July 2024, 19:01:44

Reading through postmaster code, I spotted some refactoring 
opportunities to make it slightly more readable.

Currently, when a child process exits, the postmaster first scans 
through BackgroundWorkerList to see if it was a bgworker process. If not 
found, it scans through the BackendList to see if it was a backend 
process (which it really should be then). That feels a bit silly, 
because every running background worker process also has an entry in 
BackendList. There's a lot of duplication between 
CleanupBackgroundWorker and CleanupBackend.

Before commit 8a02b3d732, we used to created Backend entries only for 
background worker processes that connected to a database, not for other 
background worker processes. I think that's why we have the code 
structure we have. But now that we have a Backend entry for all bgworker 
processes, it's more natural to have single function to deal with both 
regular backends and bgworkers.

So I came up with the attached patches. This doesn't make any meaningful 
user-visible changes, except for some incidental changes in log messages 
(see commit message for details).

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

Re: Refactoring postmaster's code to cleanup after child exit

From

Heikki Linnakangas

Date:

29 July 2024, 20:16:33

On 06/07/2024 22:01, Heikki Linnakangas wrote:
> Reading through postmaster code, I spotted some refactoring 
> opportunities to make it slightly more readable.
> 
> Currently, when a child process exits, the postmaster first scans 
> through BackgroundWorkerList to see if it was a bgworker process. If not 
> found, it scans through the BackendList to see if it was a backend 
> process (which it really should be then). That feels a bit silly, 
> because every running background worker process also has an entry in 
> BackendList. There's a lot of duplication between 
> CleanupBackgroundWorker and CleanupBackend.
> 
> Before commit 8a02b3d732, we used to created Backend entries only for 
> background worker processes that connected to a database, not for other 
> background worker processes. I think that's why we have the code 
> structure we have. But now that we have a Backend entry for all bgworker 
> processes, it's more natural to have single function to deal with both 
> regular backends and bgworkers.
> 
> So I came up with the attached patches. This doesn't make any meaningful 
> user-visible changes, except for some incidental changes in log messages 
> (see commit message for details).

New patch version attached. Fixed conflicts with recent commits, no 
other changes.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

Re: Refactoring postmaster's code to cleanup after child exit

From

Heikki Linnakangas

Date:

01 August 2024, 23:57:18

I committed the first two trivial patches, and have continued to work on 
postmaster.c, and how it manages all the child processes.

This is a lot of patches. They're built on top of each other, because 
that's the order I developed them in, but they probably could be applied 
in different order too. Please help me by reviewing these, before the 
stack grows even larger :-). Even partial reviews would be very helpful. 
I suggest to start reading them in order, and when you get tired, just 
send any comments you have up to that point.


* v3-0001-Make-BackgroundWorkerList-doubly-linked.patch

This is the same refactoring patch I started this thread with.

* v3-0003-Fix-comment-on-processes-being-kept-over-a-restar.patch
* v3-0004-Consolidate-postmaster-code-to-launch-background-.patch

Little refactoring of how postmaster launches the background processes.

* v3-0005-Add-test-for-connection-limits.patch
* v3-0006-Add-test-for-dead-end-backends.patch

A few new TAP tests for dead-end backends and enforcing connection 
limits. We didn't have much coverage for these before.

* v3-0007-Use-an-shmem_exit-callback-to-remove-backend-from.patch
* v3-0008-Introduce-a-separate-BackendType-for-dead-end-chi.patch

Some preliminary refactoring towards patch 
v3-0010-Assign-a-child-slot-to-every-postmaster-child-pro.patch

* v3-0009-Kill-dead-end-children-when-there-s-nothing-else-.patch

I noticed that we never send SIGTERM or SIGQUIT to dead-end backends, 
which seems silly. If the server is shutting down, dead-end backends 
might prevent the shutdown from completing. Dead-end backends will 
expire after authentication_timoeut (default 60s), so it won't last for 
too long, but still seems like we should kill dead-end backends if 
they're the only children preventing shutdown from completing.

* 3-0010-Assign-a-child-slot-to-every-postmaster-child-pro.patch

This is what I consider the main patch in this series. Currently, only 
regular backens, bgworkers and autovacuum workers have a PMChildFlags 
slot, which is used to detect when a postmaster child exits in an 
unclean way (in addition to the exit code). This patch assigns a child 
slot for all processes, except for dead-end backends. That includes all 
the aux processes.

While we're at it, I created separate pools of child slots for different 
kinds of backends, which fixes the issue that opening a lot of client 
connections can exhaust all the slots, so that background workers or 
autovacuum workers cannot start either [1].

[1] 
https://www.postgresql.org/message-id/55d2f50c-0b81-4b33-b202-cd2a406d69a3%40iki.fi

* v3-0011-Pass-MyPMChildSlot-as-an-explicit-argument-to-chi.patch

One more little refactoring, to pass MyPMChildSlot to the child process 
differently.


Where is all this leading? I'm not sure exactly, but having a postmaster 
child slot for every postmaster child seems highly useful. We could move 
the ProcSignal machinery to use those slot numbers for the indexes to 
the ProcSignal array, instead of ProcSignal, for example. That would 
allow all processes to participate in the signalling, even before they 
have a PGPROC entry. (Or with Thomas's interrupts refactoring, the 
interrupts array). With the multithreading work, PMChild struct could 
store a thread id, or whatever is needed for threads to communicate with 
each other. In any case, seems like it will come handy.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

Re: Refactoring postmaster's code to cleanup after child exit

From

Thomas Munro

Date:

08 August 2024, 10:47:42

On Fri, Aug 2, 2024 at 11:57 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> * v3-0001-Make-BackgroundWorkerList-doubly-linked.patch

LGTM.

> [v3-0002-Refactor-code-to-handle-death-of-a-backend-or-bgw.patch]

Currently, when a child process exits, the postmaster first scans
through BackgroundWorkerList, to see if it the child process was a
background worker. If not found, then it scans through BackendList to
see if it was a regular backend. That leads to some duplication
between the bgworker and regular backend cleanup code, as both have an
entry in the BackendList that needs to be cleaned up in the same way.
Refactor that so that we scan just the BackendList to find the child
process, and if it was a background worker, do the additional
bgworker-specific cleanup in addition to the normal Backend cleanup.

Makes sense.

On Windows, if a child process exits with ERROR_WAIT_NO_CHILDREN, it's
now logged with that exit code, instead of 0. Also, if a bgworker
exits with ERROR_WAIT_NO_CHILDREN, it's now treated as crashed and is
restarted. Previously it was treated as a normal exit.

Interesting. So when that error was first specially handled in this thread:

https://www.postgresql.org/message-id/flat/AANLkTimCTkNKKrHCd3Ot6kAsrSS7SeDpOTcaLsEP7i%2BM%40mail.gmail.com#41f60947571b75377f04af67ba6baf40

... it went from being considered a crash, to being considered like
exit(0). It's true that shared memory can't be corrupted by a process
that never enters main(), but it's better not to hide the true reason
for the failure (if it is still possible -- I don't find many
references to that phenomenon in recent times). Clobbering exitstatus
with 0 doesn't seem right at all, now that we have background workers
whose restart behaviour is affected by that.

If a child process is not found in the BackendList, the log message
now calls it "untracked child process" rather than "server process".
Arguably that should be a PANIC, because we do track all the child
processes in the list, so failing to find a child process is highly
unexpected. But if we want to change that, let's discuss and do that
as a separate commit.

Yeah, it would be highly unexpected if waitpid() told you about some
random other process (or we screwed up the bookkeeping and didn't
recognise it). So at least having a different message seems good.

> * v3-0003-Fix-comment-on-processes-being-kept-over-a-restar.patch

> * v3-0004-Consolidate-postmaster-code-to-launch-background-.patch

Much of the code in process_pm_child_exit() to launch replacement
processes when one exits or when progressing to next postmaster state
was unnecessary, because the ServerLoop will launch any missing
background processes anyway. Remove the redundant code and let
ServerLoop handle it.

+1, makes sense.

In ServerLoop, move the code to launch all the processes to a new
subroutine, to group it all together.

+1, makes sense.

More soon...

Re: Refactoring postmaster's code to cleanup after child exit

From

Heikki Linnakangas

Date:

09 August 2024, 21:13:37

On 08/08/2024 13:47, Thomas Munro wrote:
>      On Windows, if a child process exits with ERROR_WAIT_NO_CHILDREN, it's
>      now logged with that exit code, instead of 0. Also, if a bgworker
>      exits with ERROR_WAIT_NO_CHILDREN, it's now treated as crashed and is
>      restarted. Previously it was treated as a normal exit.
> 
> Interesting.  So when that error was first specially handled in this thread:
> 
>
https://www.postgresql.org/message-id/flat/AANLkTimCTkNKKrHCd3Ot6kAsrSS7SeDpOTcaLsEP7i%2BM%40mail.gmail.com#41f60947571b75377f04af67ba6baf40
> 
> ... it went from being considered a crash, to being considered like
> exit(0).  It's true that shared memory can't be corrupted by a process
> that never enters main(), but it's better not to hide the true reason
> for the failure (if it is still possible -- I don't find many
> references to that phenomenon in recent times).  Clobbering exitstatus
> with 0 doesn't seem right at all, now that we have background workers
> whose restart behaviour is affected by that.

I adjusted this ERROR_WAIT_NO_CHILDREN a little more, to avoid logging 
the death of the child twice in some cases.

>> * v3-0003-Fix-comment-on-processes-being-kept-over-a-restar.patch
> 
> +1

Committed the patches up to and including this one, with tiny comment 
changes.

>> * v3-0004-Consolidate-postmaster-code-to-launch-background-.patch
> 
>      Much of the code in process_pm_child_exit() to launch replacement
>      processes when one exits or when progressing to next postmaster state
>      was unnecessary, because the ServerLoop will launch any missing
>      background processes anyway. Remove the redundant code and let
>      ServerLoop handle it.

I'm going to work a little more on the comments on this one before 
committing; I had just moved all the "If we have lost the XXX, try to 
start a new one" comments as is, but they look pretty repetitive now.

Thanks for the review!

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Re: Refactoring postmaster's code to cleanup after child exit

From

Heikki Linnakangas

Date:

12 August 2024, 09:55:00

On 10/08/2024 00:13, Heikki Linnakangas wrote:
> On 08/08/2024 13:47, Thomas Munro wrote:
>>> * v3-0004-Consolidate-postmaster-code-to-launch-background-.patch
>>
>>      Much of the code in process_pm_child_exit() to launch replacement
>>      processes when one exits or when progressing to next postmaster 
>> state
>>      was unnecessary, because the ServerLoop will launch any missing
>>      background processes anyway. Remove the redundant code and let
>>      ServerLoop handle it.
> 
> I'm going to work a little more on the comments on this one before 
> committing; I had just moved all the "If we have lost the XXX, try to 
> start a new one" comments as is, but they look pretty repetitive now.

Pushed this now, after adjusting the comments a bit. Thanks again for 
the review!

Here are the remaining patches, rebased.

> commit a1c43d65907d20a999b203e465db1277ec842a0a
> Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date:   Thu Aug 1 17:24:12 2024 +0300
> 
>     Introduce a separate BackendType for dead-end children
>     
>     And replace postmaster.c's own "backend type" codes with BackendType
>     
>     XXX: While working on this, many times I accidentally did something
>     like "foo |= B_SOMETHING" instead of "foo |= 1 << B_SOMETHING", when
>     constructing arguments to SignalSomeChildren or CountChildren, and
>     things broke in very subtle ways taking a long time to debug. The old
>     constants that were already bitmasks avoided that. Maybe we need some
>     macro magic or something to make this less error-prone.

While rebasing this today, I spotted another instance of that mistake 
mentioned in the XXX comment above. I called "CountChildren(B_BACKEND)" 
instead of "CountChildren(1 << B_BACKEND)". Some ideas on how to make 
that less error-prone:

1. Add a separate typedef for the bitmasks, and macros/functions to work 
with it. Something like:

typedef struct {
    uint32        mask;
} BackendTypeMask;

static const BackendTypeMask BTMASK_ALL = { 0xffffffff };
static const BackendTypeMask BTMASK_NONE = { 0 };

static inline BackendTypeMask
BTMASK_ADD(BackendTypeMask mask, BackendType t)
{
    mask.mask |= 1 << t;
    return mask;
}

static inline BackendTypeMask
BTMASK_DEL(BackendTypeMask mask, BackendType t)
{
    mask.mask &= ~(1 << t);
    return mask;
}

Now the compiler will complain if you try to pass a BackendType for the 
mask. We could do this just for BackendType, or we could put this in 
src/include/lib/ with a more generic name, like "bitmask_u32".

2. Another idea is to redefine the BackendType values to be separate 
bits, like the current BACKEND_TYPE_* values in postmaster.c:

typedef enum BackendType
{
    B_INVALID = 0,

    /* Backends and other backend-like processes */
    B_BACKEND = 1 << 1,
    B_DEAD_END_BACKEND = 1 << 2,
    B_AUTOVAC_LAUNCHER = 1 << 3,
    B_AUTOVAC_WORKER = 1 << 4,

    ...
} BackendType;

Then you can use | and & on BackendTypes directly. It makes it less 
clear which function arguments are a BackendType and which are a 
bitmask, however.

Thoughts, other ideas?

-- 
Heikki Linnakangas
Neon (https://neon.tech)

On Tue, Mar 04, 2025 at 05:58:42PM -0500, Andres Freund wrote:
> On 2024-12-10 12:00:12 +0200, Heikki Linnakangas wrote:
>> 2. Move the pgstat_bestart() call earlier in the startup sequence, so that a
>> backend shows up in pg_stat_activity before it acquires a PGPROC entry, and
>> stays visible until after it has released its PGPROC entry. This would give
>> more visibility to backends that are starting up.
>
> We don't necessarily *have* a PGPROC entry for that backend when we run out of
> connections, no?

Exactly.  If I got this thread's argument right, you cannot have a
PGPROC entry that could be plugged into pg_stat_activity that early
during the startup process when collecting the startup packet.

> For this test, could we perhaps rely on the log messages postmaster logs when
> child processes exit?
>
> 2025-03-04 17:56:12.528 EST [3509838][not initialized][:0][[unknown]] LOG:  connection received: host=[local]
> 2025-03-04 17:56:12.528 EST [3509838][client backend][:0][[unknown]] FATAL:  sorry, too many clients already
> 2025-03-04 17:56:12.529 EST [3509817][postmaster][:0][] DEBUG:  releasing pm child slot 2
> 2025-03-04 17:56:12.529 EST [3509817][postmaster][:0][] DEBUG:  client backend (PID 3509838) exited with exit code 1
>
> I.e. the test could wait for the 'client backend exited' message using
> ->wait_for_log()?

Matching expected contents in the server logs is a practice I've found
to be rather reliable, with wait_for_log().  Why not adding an
injection point with a WARNING or a LOG generated, then check the
server logs for the code path taken based on the elog() generated with
the point name?
--
Michael

Attachment

signature.asc

Re: Refactoring postmaster's code to cleanup after child exit

From

Noah Misch

Date:

06 March, 07:49:33

On Tue, Mar 04, 2025 at 05:50:34PM -0500, Andres Freund wrote:
> On 2024-12-09 00:12:32 +0100, Tomas Vondra wrote:
> > [23:48:44.444](1.129s) ok 3 - reserved_connections limit
> > [23:48:44.445](0.001s) ok 4 - reserved_connections limit: matches
> > process ended prematurely at
> > /home/user/work/postgres/src/test/postmaster/../../../src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
> > line 154.
> > # Postmaster PID for node "primary" is 198592
> 
> 
> I just saw this failure on skink in the BF:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-03-04%2015%3A43%3A23
> 
> [17:05:56.438](0.247s) ok 3 - reserved_connections limit
> [17:05:56.438](0.000s) ok 4 - reserved_connections limit: matches
> process ended prematurely at
/home/bf/bf-build/skink-master/HEAD/pgsql/src/test/perl/PostgreSQL/Test/BackgroundPsql.pmline 160.

> 
> 
> > That BackgroundPsql.pm line is this in wait_connect()
> > 
> >   $self->{run}->pump()
> >     until $self->{stdout} =~ /$banner/ || $self->{timeout}->is_expired;
> 
> A big part of the problem here imo is the exception behaviour that
> IPC::Run::pump() has:
> 
>   If pump() is called after all harnessed activities have completed, a "process
>   ended prematurely" exception to be thrown.  This allows for simple scripting
>   of external applications without having to add lots of error handling code at
>   each step of the script:
> 
> Which is, uh, not very compatible with how we use IPC::Run (here and
> elsewhere).  Just ending the test because a connection failed is pretty awful.

Historically, I think we've avoided this sort of trouble by doing pipe I/O
only on processes where we feel able to predict when the process will exit.
Commit f44b9b6 is one example (simpler case, not involving pump()).  It would
be a nice improvement to do better, since there's always some risk of
unexpected exit.

> This behaviour makes it really hard to debug problems. It'd have been a lot
> easier to understand the problem if we'd seen psql's stderr before the test
> died.
> 
> I guess that mean at the very least we'd need to put an eval {} around the
> ->pump() call., print $self->{stdout}, ->{stderr} and reraise an error?

That sounds right.

Officially, you could call ->pumpable() before ->pump().  It's defined as
'Returns TRUE if calling pump() won't throw an immediate "process ended
prematurely" exception.'  I lack high confidence that it avoids the exception,
because the pump() still calls pumpable()->reap_nb()->waitpid(WNOHANG) and may
decide "process ended prematurely" based on the new finding.  In other words,
I bet there would be a TOCTOU defect in "$h->pump if $h->pumpable".

> Presumably not just in in wait_connect(), but also at least in pump_until()?

If the goal is to have it capture maximum data from processes that exit when
we don't expect it (seems good to me), yes.

Re: Refactoring postmaster's code to cleanup after child exit

From

Andres Freund

Date:

06 March, 23:16:20

Hi,

On 2025-03-05 20:49:33 -0800, Noah Misch wrote:
> > This behaviour makes it really hard to debug problems. It'd have been a lot
> > easier to understand the problem if we'd seen psql's stderr before the test
> > died.
> > 
> > I guess that mean at the very least we'd need to put an eval {} around the
> > ->pump() call., print $self->{stdout}, ->{stderr} and reraise an error?
> 
> That sounds right.

In the attached patch I did that for wait_connect().  I did verify that it
works by implementing the wait_connect() fix before fixing
002_connection_limits.pl, which fails if a sleep(1) is added just before the
proc_exit(1) for FATAL.

I didn't yet tackle pump_until() yet as it

a) uses pumpable() to check if it's safe to pump() and should kinda sometimes
   maybe report an error, even though the fact that it doesn't display stderr
   (if stout is waited on) makes it harder to debug.

b) Fixing the error report seems like it'd require an interface change to
   pump_until().

> Officially, you could call ->pumpable() before ->pump().  It's defined as
> 'Returns TRUE if calling pump() won't throw an immediate "process ended
> prematurely" exception.'

It's also documented to be internal only...

I do share your doubts re pumpable():

> I lack high confidence that it avoids the exception,
> because the pump() still calls pumpable()->reap_nb()->waitpid(WNOHANG) and may
> decide "process ended prematurely" based on the new finding.  In other words,
> I bet there would be a TOCTOU defect in "$h->pump if $h->pumpable".

On 2025-03-05 08:23:32 +0900, Michael Paquier wrote:
> > For this test, could we perhaps rely on the log messages postmaster logs when
> > child processes exit?
> > 
> > 2025-03-04 17:56:12.528 EST [3509838][not initialized][:0][[unknown]] LOG:  connection received: host=[local]
> > 2025-03-04 17:56:12.528 EST [3509838][client backend][:0][[unknown]] FATAL:  sorry, too many clients already
> > 2025-03-04 17:56:12.529 EST [3509817][postmaster][:0][] DEBUG:  releasing pm child slot 2
> > 2025-03-04 17:56:12.529 EST [3509817][postmaster][:0][] DEBUG:  client backend (PID 3509838) exited with exit code
1
> > 
> > I.e. the test could wait for the 'client backend exited' message using
> > ->wait_for_log()?
> 
> Matching expected contents in the server logs is a practice I've found
> to be rather reliable, with wait_for_log().

The attached patch implements that approach. It does fix the problem from what
I can tell. It's not great that it requires log_min_messages = DEBUG2, but
that seems ok for this test.

> Why not adding an injection point with a WARNING or a LOG generated, then
> check the server logs for the code path taken based on the elog() generated
> with the point name?

I think the log_min_messages approach is a lot simpler. If we need something
like this more widely we can reconsider injection points...

I also attached a patch to improve connect_fails()/connect_ok() test names a
bit. They weren't symmetric and I felt they were lacking in detail for the
psql return code check.

Another annoying and also funny problem I saw is this failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-03-06%2009%3A18%3A21
2025-03-06 10:42:02.552 UTC [372451][postmaster][:0] LOG:  1800 s is outside the valid range for parameter
"authentication_timeout"(1 s .. 600 s)

I had to increase PG_TEST_TIMEOUT_DEFAULT due to some other test timing out
when run under valgrind (due to having to insert a lot of rows). But then this
test runs into the above issue.

The easiest way seems to be to just limit PG_TEST_TIMEOUT_DEFAULT in this
test.

Greetings,

Andres Freund

Attachment

Re: Refactoring postmaster's code to cleanup after child exit

From

Heikki Linnakangas

Date:

06 March, 23:18:10

On 05/03/2025 01:23, Michael Paquier wrote:
> On Tue, Mar 04, 2025 at 05:58:42PM -0500, Andres Freund wrote:
>> On 2024-12-10 12:00:12 +0200, Heikki Linnakangas wrote:
>>> 2. Move the pgstat_bestart() call earlier in the startup sequence, so that a
>>> backend shows up in pg_stat_activity before it acquires a PGPROC entry, and
>>> stays visible until after it has released its PGPROC entry. This would give
>>> more visibility to backends that are starting up.
>>
>> We don't necessarily *have* a PGPROC entry for that backend when we run out of
>> connections, no?
> 
> Exactly.  If I got this thread's argument right, you cannot have a
> PGPROC entry that could be plugged into pg_stat_activity that early
> during the startup process when collecting the startup packet.

That's true in general; once you start running out of connections, you 
can indeed run out PGPROC slots too. In this particular case, though, 
there were still PGPROC slots available, reserved for superuser 
connections, so it would've helped.

We could also have more pg_stat_activity slots than PGPROC slots, or 
just have a few more PGPROC slots than what is required by MaxBackends.

>> For this test, could we perhaps rely on the log messages postmaster logs when
>> child processes exit?
>>
>> 2025-03-04 17:56:12.528 EST [3509838][not initialized][:0][[unknown]] LOG:  connection received: host=[local]
>> 2025-03-04 17:56:12.528 EST [3509838][client backend][:0][[unknown]] FATAL:  sorry, too many clients already
>> 2025-03-04 17:56:12.529 EST [3509817][postmaster][:0][] DEBUG:  releasing pm child slot 2
>> 2025-03-04 17:56:12.529 EST [3509817][postmaster][:0][] DEBUG:  client backend (PID 3509838) exited with exit code
1
>>
>> I.e. the test could wait for the 'client backend exited' message using
>> ->wait_for_log()?
> 
> Matching expected contents in the server logs is a practice I've found
> to be rather reliable, with wait_for_log().  Why not adding an
> injection point with a WARNING or a LOG generated, then check the
> server logs for the code path taken based on the elog() generated with
> the point name?

Hmm, yeah, watching for "releasing pm child slot" or an explicit 
injection point would work.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Re: Refactoring postmaster's code to cleanup after child exit

From

Heikki Linnakangas

Date:

06 March, 23:49:20

In short, all the 4 patches look good to me. Thanks for picking this up!

On 06/03/2025 22:16, Andres Freund wrote:
> On 2025-03-05 20:49:33 -0800, Noah Misch wrote:
>>> This behaviour makes it really hard to debug problems. It'd have been a lot
>>> easier to understand the problem if we'd seen psql's stderr before the test
>>> died.
>>>
>>> I guess that mean at the very least we'd need to put an eval {} around the
>>> ->pump() call., print $self->{stdout}, ->{stderr} and reraise an error?
>>
>> That sounds right.
> 
> In the attached patch I did that for wait_connect().  I did verify that it
> works by implementing the wait_connect() fix before fixing
> 002_connection_limits.pl, which fails if a sleep(1) is added just before the
> proc_exit(1) for FATAL.

+1. For the archives sake, I just want to clarify that this pump stuff 
is all about getting better error messages on a test failure. It doesn't 
help with the original issue.

This is all annoyingly complicated, but getting good error messages is 
worth it.

> On 2025-03-05 08:23:32 +0900, Michael Paquier wrote:>> Why not adding an injection point with a WARNING or a LOG
generated,
 
then
>> check the server logs for the code path taken based on the elog() generated
>> with the point name?
> 
> I think the log_min_messages approach is a lot simpler. If we need something
> like this more widely we can reconsider injection points...

+1. It's a little annoying to depend on a detail like the "client 
backend process exited" debug message, but seems like the best fix for now.

> I also attached a patch to improve connect_fails()/connect_ok() test names a
> bit. They weren't symmetric and I felt they were lacking in detail for the
> psql return code check.

+1.

While we're at it, attached are a few more cleanups I noticed.

> Another annoying and also funny problem I saw is this failure:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-03-06%2009%3A18%3A21
> 2025-03-06 10:42:02.552 UTC [372451][postmaster][:0] LOG:  1800 s is outside the valid range for parameter
"authentication_timeout"(1 s .. 600 s)
 
> 
> I had to increase PG_TEST_TIMEOUT_DEFAULT due to some other test timing out
> when run under valgrind (due to having to insert a lot of rows). But then this
> test runs into the above issue.
> 
> The easiest way seems to be to just limit PG_TEST_TIMEOUT_DEFAULT in this
> test.

LGTM

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

0001-Fix-test-name-and-username-used-in-failed-connection.patch

Re: Refactoring postmaster's code to cleanup after child exit

From

Andres Freund

Date:

07 March, 17:53:16

Hi,

On 2025-03-06 22:49:20 +0200, Heikki Linnakangas wrote:
> In short, all the 4 patches look good to me. Thanks for picking this up!
> 
> On 06/03/2025 22:16, Andres Freund wrote:
> > On 2025-03-05 20:49:33 -0800, Noah Misch wrote:
> > > > This behaviour makes it really hard to debug problems. It'd have been a lot
> > > > easier to understand the problem if we'd seen psql's stderr before the test
> > > > died.
> > > > 
> > > > I guess that mean at the very least we'd need to put an eval {} around the
> > > > ->pump() call., print $self->{stdout}, ->{stderr} and reraise an error?
> > > 
> > > That sounds right.
> > 
> > In the attached patch I did that for wait_connect().  I did verify that it
> > works by implementing the wait_connect() fix before fixing
> > 002_connection_limits.pl, which fails if a sleep(1) is added just before the
> > proc_exit(1) for FATAL.
> 
> +1. For the archives sake, I just want to clarify that this pump stuff is
> all about getting better error messages on a test failure. It doesn't help
> with the original issue.

Agreed.


> This is all annoyingly complicated, but getting good error messages is worth
> it.

Yea. I really look forward to having a way to write stuff like this that
doesn't involve hackily driving psql from 100m away using rubber bands.


> > On 2025-03-05 08:23:32 +0900, Michael Paquier wrote:>> Why not adding an
> > injection point with a WARNING or a LOG generated,
> then
> > > check the server logs for the code path taken based on the elog() generated
> > > with the point name?
> > 
> > I think the log_min_messages approach is a lot simpler. If we need something
> > like this more widely we can reconsider injection points...
> 
> +1. It's a little annoying to depend on a detail like the "client backend
> process exited" debug message, but seems like the best fix for now.

We use the same message for LOG messages too, for other types of backends, so
I think it's not that likely to change.  But stilll not great.


> While we're at it, attached are a few more cleanups I noticed.

I assume you'll apply that yourself?


Commits with updated commit messages attached.


I wonder if we should apply the polishing of connect_ok()/connect_fails() and
the wait_connect() debuggability improvements to the backbranches? Keeping TAP
infrastructure as similar as possible between branches has proven worthwhile
IME.


Greetings,

Andres Freund

Hi,

On 2025-03-07 18:03:04 +0100, Tomas Vondra wrote:
> On 3/7/25 16:49, Andres Freund wrote:
> > Hi,
> > 
> > On 2025-03-07 16:25:09 +0100, Tomas Vondra wrote:
> >> FWIW I keep running into this (and skink seems unhappy too). I ended up
> >> just adding a sleep(1), right before
> >>
> >> push(@sessions, background_psql_as_user('regress_superuser'));
> >>
> >> and that makes it work on all my machines (including rpi5).
> > 
> > Can you confirm that the fix attached to my prior email suffices to address
> > the issue on your machine too?  I'm planning to push the fixes soon.
> > 
> 
> Yes, the v2 fixes that too.

Cool, thanks for testing.


> I got confused by the message suggesting
> 
>   ... this pump stuff is all about getting better error messages
>   on a test failure. It doesn't help with the original issue.
> 
> which made me believe the tests will still fail, so I haven't tried the
> patches before.

That was just about 0002 (and 0001) neither fixing the race themselves, nor
being required to fix the race. 0002 does make it easier to understand what
went wrong, that's all...

Greetings,

Andres Freund