Re: Refactoring postmaster's code to cleanup after child exit - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Refactoring postmaster's code to cleanup after child exit
Date
Msg-id 574340ca-c90e-4086-9ae2-cd2f1359e989@iki.fi
Whole thread Raw
In response to Refactoring postmaster's code to cleanup after child exit  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: Refactoring postmaster's code to cleanup after child exit
List pgsql-hackers
On 09/12/2024 14:47, Tomas Vondra wrote:
> On 12/9/24 13:30, Heikki Linnakangas wrote:
>> On 09/12/2024 01:12, Tomas Vondra wrote:
>>> On 11/14/24 15:13, Heikki Linnakangas wrote:
>>>> On 09/10/2024 23:40, Heikki Linnakangas wrote:
>>>>> I pushed the first three patches, with the new test and one of the
>>>>> small
>>>>> refactoring patches. Thanks for all the comments so far! Here is a new
>>>>> version of the remaining patches.
>>>>>
>>> Hi, the TAP test 001_connection_limits.pl introduced by 6a1d0d470e84
>>> seems to have problems with valgrind :-( I reliably get this failure:
>>
>> How exactly do you run the test with valgrind? What platform?
>>
> 
> It failed for me on both amd64 (Fedora 41) and rpi5 32/64-bit (Debian).
> 
>> It works for me, with this:
>>
>> (cd build && ninja && rm -rf tmp_install && meson test --suite setup &&
>> valgrind --leak-check=no --gen-suppressions=all --suppressions=/home/
>> heikki/git-sandbox/postgresql/src/tools/valgrind.supp --time-stamp=yes
>> --error-markers=VALGRINDERROR-BEGIN,VALGRINDERROR-END --log-file=$HOME/
>> pg-valgrind/%p.log --trace-children=yes meson test --suite postmaster )
>>
> 
> I have a patch that tweaks pg_ctl/pg_regress to execute valgrind, so I
> just do
> 
> ./configure --enable-debug --prefix=/home/user/builds/master
> --enable-depend --enable-cassert --enable-tap-tests CPPFLAGS="-O0 -ggdb3
> -DUSE_VALGRIND"
> 
> and then the usual "make check" or whatever.
> 
> The patch has a hardcoded path to the .supp file, and places the
> valgrind log into /tmp. It has worked for me fine up until that commit,
> and it still seems to be working in every other test directory.

Ok, I was able to reproduce this with that setup.

Unsurprisingly, it's a timing issue. It can be reproduced without 
valgrind by adding this delay:

diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 289059435a9..1eb6bad72ca 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -583,6 +583,7 @@ errfinish(const char *filename, int lineno, const 
char *funcname)
           * FATAL termination.  The postmaster may or may not consider this
           * worthy of panic, depending on which subprocess returns it.
           */
+        sleep(1);
          proc_exit(1);
      }

The test opens a connection that is expected to fail with the "remaining 
connection slots are reserved for roles with the SUPERUSER attribute" 
error. Right after that, it opens a new connection as superuser, and 
expects it to succeed. But if the previous backend hasn't exited yet, 
the new connection fails with "too many clients already".

Not sure how to fix this. A small sleep in the test would work, but in 
principle there's no delay that's guaranteed to be enough. A more robust 
solution would be to run a "select count(*) from pg_stat_activity" and 
wait until the number of connections are what's expected. I'll try that 
and see how complicated that gets..

-- 
Heikki Linnakangas
Neon (https://neon.tech)



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Memory leak in WAL sender with pgoutput (v10~)
Next
From: Andres Freund
Date:
Subject: Managing IO workers in postmaster's state machine