Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED - Mailing list pgsql-hackers

From Alexander Lakhin
Subject Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
Date
Msg-id 4245b0de-9367-c443-8145-7b62b0ab3c41@gmail.com
Whole thread Raw
In response to Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED  (Andres Freund <andres@anarazel.de>)
Responses Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
List pgsql-hackers
18.02.2023 23:09, Andres Freund wrote:
> Hi,
>
> On 2023-02-18 18:00:00 +0300, Alexander Lakhin wrote:
>> 18.02.2023 04:06, Andres Freund wrote:
>> Maybe it's just the backend started for the money test has got
>> the same PID (5948) that the backend for the name test had?
> I somehow mashed name and money into one test in my head... So forget what I
> wrote.
>
> That doesn't really explain the assertion though.
> ,,,
> It's definitely very aggressive in reusing pids - and it seems to
> intentionally do work to keep pids small. I wonder if it'd be worth trying to
> exercise this path aggressively by configuring a very low max pid on linux, in
> an EXEC_BACKEND build.
For information, I performed the simple analysis of
the postmaster.log after ordinary `vcregress check`:
grep -o -E 'client backend\[[0-9]+\] pg_regress/\w+' 
..\..\..\src\test\regress\log\postmaster.log | sort | uniq | wc -l
grep -o -E 'client backend\[[0-9]+\] pg_regress' 
..\..\..\src\test\regress\log\postmaster.log | sort | uniq | wc -l
and got the following numbers:
...
iteration 88
196
174
iteration 89
196
170
iteration 90
196
176
iteration 91
196
175

Then I performed some more experiments and came to the conclusion
that `vcregress check` is not very suitable for catching duplicate pids.
So I've wrote a simple TAP test and caught the condition sought:
t/099_check_pids.pl .. # 0
# 1
# 2
# 3
# 4
# 5
# 6
# Thread 15 failed:
# Got two equal pids in a row: 3552 on iteration 30
# 1

t/099_check_pids.pl .. 1/1 #   Failed test at t/099_check_pids.pl line 107.
...

server log contains:
...
2023-02-20 06:33:16.142 PST|[unknown]|[unknown]|5704|63f384ac.1648|LOG:  
connection received: host=127.0.0.1 port=55134
2023-02-20 06:33:16.142 PST|[unknown]|[unknown]|3552|63f384ac.de0|LOG:  
connection received: host=127.0.0.1 port=55138
2023-02-20 06:33:16.144 PST|postgres|postgres|5704|63f384ac.1648|LOG:  
connection authorized: user=postgres database=postgres 
application_name=099_check_pids.pl
2023-02-20 06:33:16.144 PST|postgres|postgres|3552|63f384ac.de0|LOG:  
connection authorized: user=postgres database=postgres 
application_name=099_check_pids.pl
2023-02-20 06:33:16.147 PST|postgres|postgres|5704|63f384ac.1648|LOG:  
statement: SELECT pg_backend_pid()
2023-02-20 06:33:16.147 PST|postgres|postgres|3552|63f384ac.de0|LOG:  
statement: SELECT pg_backend_pid()
2023-02-20 06:33:16.147 PST|postgres|postgres|5704|63f384ac.1648|LOG:  
disconnection: session time: 0:00:00.009 user=postgres database=postgres 
host=127.0.0.1 port=55134
2023-02-20 06:33:16.147 PST|postgres|postgres|3552|63f384ac.de0|LOG:  
disconnection: session time: 0:00:00.008 user=postgres database=postgres 
host=127.0.0.1 port=55138
2023-02-20 06:33:16.158 PST|[unknown]|[unknown]|1672|63f384ac.688|LOG:  
connection received: host=127.0.0.1 port=55139

...
2023-02-20 06:33:16.485 PST|postgres|postgres|2748|63f384ac.abc|LOG:  
connection authorized: user=postgres database=postgres 
application_name=099_check_pids.pl
2023-02-20 06:33:16.486 PST|[unknown]|[unknown]|3552|63f384ac.de0|LOG:  
connection received: host=127.0.0.1 port=55164
2023-02-20 06:33:16.487 PST|postgres|postgres|3552|63f384ac.de0|LOG:  
connection authorized: user=postgres database=postgres 
application_name=099_check_pids.pl
2023-02-20 06:33:16.488 PST|postgres|postgres|2748|63f384ac.abc|LOG:  
statement: SELECT pg_backend_pid()
2023-02-20 06:33:16.489 PST|postgres|postgres|2748|63f384ac.abc|LOG:  
disconnection: session time: 0:00:00.007 user=postgres database=postgres 
host=127.0.0.1 port=55163
2023-02-20 06:33:16.490 PST|postgres|postgres|3552|63f384ac.de0|LOG:  
statement: SELECT pg_backend_pid()
2023-02-20 06:33:16.491 PST|postgres|postgres|3552|63f384ac.de0|LOG:  
disconnection: session time: 0:00:00.008 user=postgres database=postgres 
host=127.0.0.1 port=55164
2023-02-20 06:33:16.503 PST|[unknown]|[unknown]|244|63f384ac.f4|LOG:  
connection received: host=127.0.0.1 port=55162
...
(note that even session IDs are the same)

Though I've got no that assert yet, thus maybe as you said, the pid 
duplication
is not the (only) condition that triggered it (or maybe something is 
wrong on my side).

Interestingly, but I can get the expected duplicates only if I start the 
VS prompt
and run the test immediately after a logon (or VM reboot).
After some activity (or may be some time), the test can run for 30 minutes
without success (maybe it depends on OS cache...).

Also with a similar TAP test I discovered that on Linux (Debian 11 
32-bit) pids
are generated sequentially:
# pid: 329
# pid: 331
# pid: 333
# pid: 336
# pid: 338
# pid: 340
# pid: 343
# pid: 345
# pid: 349
# pid: 353
So with an extra small max_pid (<360) I just get
"could not fork new process for connection: Resource temporarily 
unavailable'"
before two processes, that started one after another, get the same pid.

But on Windows the sequence looks random:
# pid: 1736
# pid: 8168
# pid: 3764
# pid: 7180
# pid: 3724
# pid: 5372
# pid: 1588
# pid: 4188
# pid: 1404
# pid: 5280

Best regards,
Alexander
Attachment

pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Move defaults toward ICU in 16?
Next
From: Peter Eisentraut
Date:
Subject: Re: Improving inferred query column names