Thread: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

From
Andres Freund
Date:
Hi,

A recent cfbot run caused CI on windows to crash - on a patch that could not
conceivably cause this issue:
  https://cirrus-ci.com/task/5646021133336576
the patch is just:
  https://github.com/postgresql-cfbot/postgresql/commit/dbd4afa6e7583c036b86abe2e3d27b508d335c2b

regression.diffs:
https://api.cirrus-ci.com/v1/artifact/task/5646021133336576/testrun/build/testrun/regress/regress/regression.diffs
postmaster.log:
https://api.cirrus-ci.com/v1/artifact/task/5646021133336576/testrun/build/testrun/regress/regress/log/postmaster.log
crash info:
https://api.cirrus-ci.com/v1/artifact/task/5646021133336576/crashlog/crashlog-postgres.exe_1af0_2023-02-08_00-53-23-997.txt

00000085`f03ffa40 00007ff6`fd89faa8     ucrtbased!abort(void)+0x5a [minkernel\crts\ucrt\src\appcrt\startup\abort.cpp @
77]
00000085`f03ffa80 00007ff6`fd6474dc     postgres!ExceptionalCondition(
            char * conditionName = 0x00007ff6`fdd03ca8 "PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED",
            char * fileName = 0x00007ff6`fdd03c80 "../src/backend/storage/ipc/pmsignal.c",
            int lineNumber = 0n329)+0x78 [c:\cirrus\src\backend\utils\error\assert.c @ 67]
00000085`f03ffac0 00007ff6`fd676eff     postgres!MarkPostmasterChildActive(void)+0x7c
[c:\cirrus\src\backend\storage\ipc\pmsignal.c@ 329]
 
00000085`f03ffb00 00007ff6`fd59aa3a     postgres!InitProcess(void)+0x2ef [c:\cirrus\src\backend\storage\lmgr\proc.c @
375]
00000085`f03ffb60 00007ff6`fd467689     postgres!SubPostmasterMain(
            int argc = 0n3,
            char ** argv = 0x000001c6`f3814e80)+0x33a [c:\cirrus\src\backend\postmaster\postmaster.c @ 4962]
00000085`f03ffd90 00007ff6`fda0e1c9     postgres!main(
            int argc = 0n3,
            char ** argv = 0x000001c6`f3814e80)+0x2f9 [c:\cirrus\src\backend\main\main.c @ 192]

So, somehow we ended up a pmsignal slot for a new backend that's not currently
in PM_CHILD_ASSIGNED state.


Obviously the first idea is to wonder whether this is a problem introduced as
part of the the recent postmaster-latchification work.


At first I thought we were failing to terminate running processes, due to the
following output:

parallel group (20 tests):  name char txid text varchar enum float8 regproc int2 boolean bit oid pg_lsn int8 int4
float4uuid rangetypes numeric money
 
     boolean                      ... ok          684 ms
     char                         ... ok          517 ms
     name                         ... ok          354 ms
     varchar                      ... ok          604 ms
     text                         ... ok          603 ms
     int2                         ... ok          676 ms
     int4                         ... ok          818 ms
     int8                         ... ok          779 ms
     oid                          ... ok          720 ms
     float4                       ... ok          823 ms
     float8                       ... ok          628 ms
     bit                          ... ok          666 ms
     numeric                      ... ok         1132 ms
     txid                         ... ok          497 ms
     uuid                         ... ok          818 ms
     enum                         ... ok          619 ms
     money                        ... FAILED (test process exited with exit code 2)     7337 ms
     rangetypes                   ... ok          813 ms
     pg_lsn                       ... ok          762 ms
     regproc                      ... ok          632 ms


But now I realize the reason none of the other tests failed, is because the
crash took a long time, presumably due to the debugger creating the above
information, so no other tests failed.


2023-02-08 00:53:20.257 GMT client backend[4584] pg_regress/rangetypes STATEMENT:  select '-[a,z)'::textrange;
TRAP: failed Assert("PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED"), File:
"../src/backend/storage/ipc/pmsignal.c",Line: 329, PID: 5948
 
[ quite a few lines ]
2023-02-08 00:53:27.420 GMT postmaster[872] LOG:  server process (PID 5948) was terminated by exception 0xC0000354
2023-02-08 00:53:27.420 GMT postmaster[872] HINT:  See C include file "ntstatus.h" for a description of the hexadecimal
value.
2023-02-08 00:53:27.420 GMT postmaster[872] LOG:  terminating any other active server processes
2023-02-08 00:53:27.434 GMT postmaster[872] LOG:  all server processes terminated; reinitializing
2023-02-08 00:53:27.459 GMT startup[5800] LOG:  database system was interrupted; last known up at 2023-02-08 00:53:19
GMT
2023-02-08 00:53:27.459 GMT startup[5800] LOG:  database system was not properly shut down; automatic recovery in
progress
2023-02-08 00:53:27.462 GMT startup[5800] LOG:  redo starts at 0/20DCF08
2023-02-08 00:53:27.484 GMT startup[5800] LOG:  could not stat file "pg_tblspc/16502": No such file or directory
2023-02-08 00:53:27.484 GMT startup[5800] CONTEXT:  WAL redo at 0/20DCFB8 for Tablespace/DROP: 16502
2023-02-08 00:53:27.614 GMT startup[5800] LOG:  invalid record length at 0/25353E8: wanted 24, got 0
2023-02-08 00:53:27.614 GMT startup[5800] LOG:  redo done at 0/2534FE0 system usage: CPU: user: 0.04 s, system: 0.04 s,
elapsed:0.15 s
 


Nevertheless, clearly this should never be reached.

Greetings,

Andres Freund



Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

From
Thomas Munro
Date:
On Wed, Feb 8, 2023 at 2:28 PM Andres Freund <andres@anarazel.de> wrote:
> 2023-02-08 00:53:20.257 GMT client backend[4584] pg_regress/rangetypes STATEMENT:  select '-[a,z)'::textrange;
> TRAP: failed Assert("PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED"), File:
"../src/backend/storage/ipc/pmsignal.c",Line: 329, PID: 5948
 

No idea what's going on yet, but this assertion failure is very
familiar to me, as one of the ways that lorikeet fails/failed (though
it hasn't failed like that since the postmaster latchification).
There it was because Cygwin's signal blocking is unreliable, so the
postmaster could start a backend, while already being in the middle of
starting a backend.  That particular problem shouldn't be possible
anymore; now we can only start backends from inside the main event
loop.  Hmm.  (State machine bug?  Some confusion about processes
caused by the fact that PID was recycled?)



Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

From
Thomas Munro
Date:
On Wed, Feb 8, 2023 at 2:52 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Feb 8, 2023 at 2:28 PM Andres Freund <andres@anarazel.de> wrote:
> > TRAP: failed Assert("PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED"), File:
"../src/backend/storage/ipc/pmsignal.c",Line: 329, PID: 5948
 

I was wondering if commit 18a4a620 might be relevant, as it touched
the management of those slots a few months ago, but then I found a
couple of matches from 2021 in my database of build farm assertion
failures:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-08-12%2010:38:56
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-09-28%2016:40:49

Fairywren is a Windows + msys2 system, so it doesn't share Cygwin's
signal system, it's running the pure Windows code (though it's GCC
instead of MSVC and has a different libc, it's using our Windows
native code paths and defines WIN32).



Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

From
Thomas Munro
Date:
On Wed, Feb 8, 2023 at 4:00 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-08-12%2010:38:56
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-09-28%2016:40:49

These were a bit different though.  They also logged "could not
reserve shared memory region".  And they don't have a user of the same
PID logging stuff immediately preceding the failure.



Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

From
Thomas Munro
Date:
I still have no theory for how this condition was reached despite a
lot of time thinking about it and searching for more clues.  As far as
I can tell, the recent improvements to postmaster's signal and event
handling shouldn't be related: the state management and logic was
unchanged.

While failing to understand this, I worked[1] on CI log indexing tool
with public reports that highlight this sort of thing[2], so I'll be
watching out for more evidence.  Unfortunately I have no data from
before 1 Feb (cfbot previously wasn't interested in the past at all;
I'd need to get my hands on the commit IDs for earlier testing but I
can't figure out how to get those out of Cirrus or Github -- anyone
know how?).  FWIW I have a thing I call bfbot for slurping up similar
data from the build farm.  It's not pretty enough for public
consumption, but I do know that this assertion hasn't failed there,
except the cases I mentioned earlier, and a load of failures on
lorikeet which was completely b0rked until recently.

[1] https://xkcd.com/974/
[2] http://cfbot.cputube.org/highlights/assertion-90.html



Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

From
Andres Freund
Date:
Hi,

On 2023-02-18 13:27:04 +1300, Thomas Munro wrote:
> I still have no theory for how this condition was reached despite a
> lot of time thinking about it and searching for more clues.  As far as
> I can tell, the recent improvements to postmaster's signal and event
> handling shouldn't be related: the state management and logic was
> unchanged.

Yea, it's all very odd.

If you look at the log:

2023-02-08 00:53:20.175 GMT client backend[5948] pg_regress/name DETAIL:  No valid identifier after ".".
2023-02-08 00:53:20.175 GMT client backend[5948] pg_regress/name STATEMENT:  SELECT parse_ident('xxx.1020');
...
TRAP: failed Assert("PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED"), File:
"../src/backend/storage/ipc/pmsignal.c",Line: 329, PID: 5948
 
abort() has been called
...
2023-02-08 00:53:27.420 GMT postmaster[872] LOG:  server process (PID 5948) was terminated by exception 0xC0000354
2023-02-08 00:53:27.420 GMT postmaster[872] HINT:  See C include file "ntstatus.h" for a description of the hexadecimal
value.
2023-02-08 00:53:27.420 GMT postmaster[872] LOG:  terminating any other active server processes
2023-02-08 00:53:27.434 GMT postmaster[872] LOG:  all server processes terminated; reinitializing


and that it's indeed the money test that failed:
     money                        ... FAILED (test process exited with exit code 2)     7337 ms

it's very hard to understand how this stack can come to be:

00000085`f03ffa40 00007ff6`fd89faa8     ucrtbased!abort(void)+0x5a [minkernel\crts\ucrt\src\appcrt\startup\abort.cpp @
77]
00000085`f03ffa80 00007ff6`fd6474dc     postgres!ExceptionalCondition(
            char * conditionName = 0x00007ff6`fdd03ca8 "PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED", 
            char * fileName = 0x00007ff6`fdd03c80 "../src/backend/storage/ipc/pmsignal.c", 
            int lineNumber = 0n329)+0x78 [c:\cirrus\src\backend\utils\error\assert.c @ 67]
00000085`f03ffac0 00007ff6`fd676eff     postgres!MarkPostmasterChildActive(void)+0x7c
[c:\cirrus\src\backend\storage\ipc\pmsignal.c@ 329]
 
00000085`f03ffb00 00007ff6`fd59aa3a     postgres!InitProcess(void)+0x2ef [c:\cirrus\src\backend\storage\lmgr\proc.c @
375]
00000085`f03ffb60 00007ff6`fd467689     postgres!SubPostmasterMain(
            int argc = 0n3, 
            char ** argv = 0x000001c6`f3814e80)+0x33a [c:\cirrus\src\backend\postmaster\postmaster.c @ 4962]
00000085`f03ffd90 00007ff6`fda0e1c9     postgres!main(
            int argc = 0n3, 
            char ** argv = 0x000001c6`f3814e80)+0x2f9 [c:\cirrus\src\backend\main\main.c @ 192]

How can a process that we did notify crashing, that has already executed SQL
statements, end up in MarkPostmasterChildActive()?



> While failing to understand this, I worked[1] on CI log indexing tool
> with public reports that highlight this sort of thing[2], so I'll be
> watching out for more evidence.  Unfortunately I have no data from
> before 1 Feb (cfbot previously wasn't interested in the past at all;
> I'd need to get my hands on the commit IDs for earlier testing but I
> can't figure out how to get those out of Cirrus or Github -- anyone
> know how?).  FWIW I have a thing I call bfbot for slurping up similar
> data from the build farm.  It's not pretty enough for public
> consumption, but I do know that this assertion hasn't failed there,
> except the cases I mentioned earlier, and a load of failures on
> lorikeet which was completely b0rked until recently.

> [1] https://xkcd.com/974/
> [2] http://cfbot.cputube.org/highlights/assertion-90.html

I think this extremely useful.

Greetings,

Andres Freund



Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

From
Justin Pryzby
Date:
On Sat, Feb 18, 2023 at 01:27:04PM +1300, Thomas Munro wrote:
> (cfbot previously wasn't interested in the past at all;
> I'd need to get my hands on the commit IDs for earlier testing but I
> can't figure out how to get those out of Cirrus or Github -- anyone
> know how?).

I wish I knew - my only suggestion is to scrape it out of "git reflog",
but that only works if you configured it to save a huge reflog or saved
the historic output of "git reflog".  Or if you don't change the branch
often, which I imagine doesn't hold true here.

-- 
Justin



Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

From
Alexander Lakhin
Date:
Hello,
18.02.2023 04:06, Andres Freund wrote:
> Hi,
>
> On 2023-02-18 13:27:04 +1300, Thomas Munro wrote:
>> I still have no theory for how this condition was reached despite a
>> lot of time thinking about it and searching for more clues.  As far as
>> I can tell, the recent improvements to postmaster's signal and event
>> handling shouldn't be related: the state management and logic was
>> unchanged.
> Yea, it's all very odd.
>
> If you look at the log:
>
> 2023-02-08 00:53:20.175 GMT client backend[5948] pg_regress/name DETAIL:  No valid identifier after ".".
> 2023-02-08 00:53:20.175 GMT client backend[5948] pg_regress/name STATEMENT:  SELECT parse_ident('xxx.1020');
> ...
> TRAP: failed Assert("PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED"), File:
"../src/backend/storage/ipc/pmsignal.c",Line: 329, PID: 5948
 
> abort() has been called
> ...
> 2023-02-08 00:53:27.420 GMT postmaster[872] LOG:  server process (PID 5948) was terminated by exception 0xC0000354
> 2023-02-08 00:53:27.420 GMT postmaster[872] HINT:  See C include file "ntstatus.h" for a description of the
hexadecimalvalue.
 
> 2023-02-08 00:53:27.420 GMT postmaster[872] LOG:  terminating any other active server processes
> 2023-02-08 00:53:27.434 GMT postmaster[872] LOG:  all server processes terminated; reinitializing
>
>
> and that it's indeed the money test that failed:
>       money                        ... FAILED (test process exited with exit code 2)     7337 ms
>
> it's very hard to understand how this stack can come to be:
>
> 00000085`f03ffa40 00007ff6`fd89faa8     ucrtbased!abort(void)+0x5a [minkernel\crts\ucrt\src\appcrt\startup\abort.cpp
@77]
 
> 00000085`f03ffa80 00007ff6`fd6474dc     postgres!ExceptionalCondition(
>             char * conditionName = 0x00007ff6`fdd03ca8 "PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED",
>             char * fileName = 0x00007ff6`fdd03c80 "../src/backend/storage/ipc/pmsignal.c",
>             int lineNumber = 0n329)+0x78 [c:\cirrus\src\backend\utils\error\assert.c @ 67]
> 00000085`f03ffac0 00007ff6`fd676eff     postgres!MarkPostmasterChildActive(void)+0x7c
[c:\cirrus\src\backend\storage\ipc\pmsignal.c@ 329]
 
> 00000085`f03ffb00 00007ff6`fd59aa3a     postgres!InitProcess(void)+0x2ef [c:\cirrus\src\backend\storage\lmgr\proc.c @
375]
> 00000085`f03ffb60 00007ff6`fd467689     postgres!SubPostmasterMain(
>             int argc = 0n3,
>             char ** argv = 0x000001c6`f3814e80)+0x33a [c:\cirrus\src\backend\postmaster\postmaster.c @ 4962]
> 00000085`f03ffd90 00007ff6`fda0e1c9     postgres!main(
>             int argc = 0n3,
>             char ** argv = 0x000001c6`f3814e80)+0x2f9 [c:\cirrus\src\backend\main\main.c @ 192]
>
> How can a process that we did notify crashing, that has already executed SQL
> statements, end up in MarkPostmasterChildActive()?
Maybe it's just the backend started for the money test has got
the same PID (5948) that the backend for the name test had?
A simple script that I've found [1] shows that the pids reused rather often
(for me, approximately each 300 process starts in Windows 10 H2), buy maybe
under some circumstances (many concurrent processes?) PIDs can coincide even
so often to trigger that behavior.

[1] https://superuser.com/questions/636497/does-windows-7-reuse-process-ids



Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

From
Andres Freund
Date:
Hi,

On 2023-02-18 18:00:00 +0300, Alexander Lakhin wrote:
> 18.02.2023 04:06, Andres Freund wrote:
> > On 2023-02-18 13:27:04 +1300, Thomas Munro wrote:
> > How can a process that we did notify crashing, that has already executed
> > SQL statements, end up in MarkPostmasterChildActive()?
>
> Maybe it's just the backend started for the money test has got
> the same PID (5948) that the backend for the name test had?

I somehow mashed name and money into one test in my head... So forget what I
wrote.

That doesn't really explain the assertion though.


It's too bad that we didn't use doesn't include
log_connections/log_disconnections. If nothing else, it makes it a lot easier
to identify problems like that. We actually do try to configure it for CI, but
it currently doesn't work for pg_regress style tests with meson. Need to fix
that. Starting a thread.



One thing that made me very suspicious when reading related code is this
remark:

bool
ReleasePostmasterChildSlot(int slot)
...
    /*
     * Note: the slot state might already be unused, because the logic in
     * postmaster.c is such that this might get called twice when a child
     * crashes.  So we don't try to Assert anything about the state.
     */

That seems fragile, and potentially racy. What if we somehow can end up
starting another backend inbetween the two ReleasePostmasterChildSlot() calls,
we can end up marking a slot that, newly, has a process associated with it, as
inactive? Once the slot has been released the first time, it can be assigned
again.


ISTM that it's not a good idea that we use PM_CHILD_ASSIGNED to signal both,
that a slot has not been used yet, and that it's not in use anymore. I think
that makes it quite a bit harder to find state management issues.



> A simple script that I've found [1] shows that the pids reused rather often
> (for me, approximately each 300 process starts in Windows 10 H2), buy maybe
> under some circumstances (many concurrent processes?) PIDs can coincide even
> so often to trigger that behavior.

It's definitely very aggressive in reusing pids - and it seems to
intentionally do work to keep pids small. I wonder if it'd be worth trying to
exercise this path aggressively by configuring a very low max pid on linux, in
an EXEC_BACKEND build.

Greetings,

Andres Freund



Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

From
Andrew Dunstan
Date:


On 2023-02-17 Fr 19:27, Thomas Munro wrote:
FWIW I have a thing I call bfbot for slurping up similar
data from the build farm.  It's not pretty enough for public
consumption, but I do know that this assertion hasn't failed there,
except the cases I mentioned earlier, and a load of failures on
lorikeet which was completely b0rked until recently.


Are there things we need to do on the server side to make data extraction easier?


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

From
Thomas Munro
Date:
On Mon, Feb 20, 2023 at 2:46 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> On 2023-02-17 Fr 19:27, Thomas Munro wrote:
>> FWIW I have a thing I call bfbot for slurping up similar
>> data from the build farm.  It's not pretty enough for public
>> consumption, but I do know that this assertion hasn't failed there,
>> except the cases I mentioned earlier, and a load of failures on
>> lorikeet which was completely b0rked until recently.
>
> Are there things we need to do on the server side to make data extraction easier?

It's a good question.

One thought Andres mentioned to me is whether we might want to have an
in-tree tool to find interesting stuff.  That is, even locally during
development, but also in the CI + buildfarm, a common tool could find
and spit out human- and machine-readable highlights (backtraces,
PANICs, assertions, ... like cfbot is now doing).  Then the knowledge
of what's interesting would be maintained and extended by all of us.

On the other hand, as we think of new patterns over time to look out
for, it's also nice to be able to re-scan old data to see if the new
patterns occurred in the past (I've done this several times with
cfbot's new highlight analyser as I corrected mistakes and added
patterns).  So maybe that's also a good idea, but a separate thing.
Even if the analyser logic is not in-tree, we could try to make
something that works pretty much the same across CI and BF.  Perhaps
we could think about some of those ideas once the BF is using meson?
Aside from having just one system to think about, the meson build
system is a bit more structured: it has a more formal concept of test
suites and tests with machine readable results from the top level
(JSON files etc), with names strictly corresponding to directories
where the output is, etc.  I think I'd basically want a complete list
of available files (= like the artifacts on CI), and then I'd pull
down the meson test result file and then decide which other files I
also want to pull down (ie stuff relating to failed tests) to analyse.
(Not that any of that is intractable with the autoconf or handrolled
perl/MSVC stuff, it's just messier, and hard to get motivated when its
days are numbered.)

One little thing I remembered while looking into this general topic is
the noise you get when we crash during pg_regress, which it'd be nice
to fix:

https://www.postgresql.org/message-id/flat/CA%2BhUKGL7hxqbadkto7e1FCOLQhuHg%3DwVn_PDZd6fDMbQrrZisA%40mail.gmail.com

Another topic I'm interested in is how to find useful signals in the
timing data.  For example, when Nathan and I worked on walreceiver
wakeup improvements, we didn't notice that we'd caused some tests to
become dramatically slower, because of a pre-existing bug/thinko we
hadn't noticed.  I want a computer to tell me about this stuff.
That's somewhat tricky because of all the noise, but hopefully it's
not beyond the powers of statistics to notice that a test unexpectedly
took a nap for 10s.



Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

From
Alexander Lakhin
Date:
18.02.2023 23:09, Andres Freund wrote:
> Hi,
>
> On 2023-02-18 18:00:00 +0300, Alexander Lakhin wrote:
>> 18.02.2023 04:06, Andres Freund wrote:
>> Maybe it's just the backend started for the money test has got
>> the same PID (5948) that the backend for the name test had?
> I somehow mashed name and money into one test in my head... So forget what I
> wrote.
>
> That doesn't really explain the assertion though.
> ,,,
> It's definitely very aggressive in reusing pids - and it seems to
> intentionally do work to keep pids small. I wonder if it'd be worth trying to
> exercise this path aggressively by configuring a very low max pid on linux, in
> an EXEC_BACKEND build.
For information, I performed the simple analysis of
the postmaster.log after ordinary `vcregress check`:
grep -o -E 'client backend\[[0-9]+\] pg_regress/\w+' 
..\..\..\src\test\regress\log\postmaster.log | sort | uniq | wc -l
grep -o -E 'client backend\[[0-9]+\] pg_regress' 
..\..\..\src\test\regress\log\postmaster.log | sort | uniq | wc -l
and got the following numbers:
...
iteration 88
196
174
iteration 89
196
170
iteration 90
196
176
iteration 91
196
175

Then I performed some more experiments and came to the conclusion
that `vcregress check` is not very suitable for catching duplicate pids.
So I've wrote a simple TAP test and caught the condition sought:
t/099_check_pids.pl .. # 0
# 1
# 2
# 3
# 4
# 5
# 6
# Thread 15 failed:
# Got two equal pids in a row: 3552 on iteration 30
# 1

t/099_check_pids.pl .. 1/1 #   Failed test at t/099_check_pids.pl line 107.
...

server log contains:
...
2023-02-20 06:33:16.142 PST|[unknown]|[unknown]|5704|63f384ac.1648|LOG:  
connection received: host=127.0.0.1 port=55134
2023-02-20 06:33:16.142 PST|[unknown]|[unknown]|3552|63f384ac.de0|LOG:  
connection received: host=127.0.0.1 port=55138
2023-02-20 06:33:16.144 PST|postgres|postgres|5704|63f384ac.1648|LOG:  
connection authorized: user=postgres database=postgres 
application_name=099_check_pids.pl
2023-02-20 06:33:16.144 PST|postgres|postgres|3552|63f384ac.de0|LOG:  
connection authorized: user=postgres database=postgres 
application_name=099_check_pids.pl
2023-02-20 06:33:16.147 PST|postgres|postgres|5704|63f384ac.1648|LOG:  
statement: SELECT pg_backend_pid()
2023-02-20 06:33:16.147 PST|postgres|postgres|3552|63f384ac.de0|LOG:  
statement: SELECT pg_backend_pid()
2023-02-20 06:33:16.147 PST|postgres|postgres|5704|63f384ac.1648|LOG:  
disconnection: session time: 0:00:00.009 user=postgres database=postgres 
host=127.0.0.1 port=55134
2023-02-20 06:33:16.147 PST|postgres|postgres|3552|63f384ac.de0|LOG:  
disconnection: session time: 0:00:00.008 user=postgres database=postgres 
host=127.0.0.1 port=55138
2023-02-20 06:33:16.158 PST|[unknown]|[unknown]|1672|63f384ac.688|LOG:  
connection received: host=127.0.0.1 port=55139

...
2023-02-20 06:33:16.485 PST|postgres|postgres|2748|63f384ac.abc|LOG:  
connection authorized: user=postgres database=postgres 
application_name=099_check_pids.pl
2023-02-20 06:33:16.486 PST|[unknown]|[unknown]|3552|63f384ac.de0|LOG:  
connection received: host=127.0.0.1 port=55164
2023-02-20 06:33:16.487 PST|postgres|postgres|3552|63f384ac.de0|LOG:  
connection authorized: user=postgres database=postgres 
application_name=099_check_pids.pl
2023-02-20 06:33:16.488 PST|postgres|postgres|2748|63f384ac.abc|LOG:  
statement: SELECT pg_backend_pid()
2023-02-20 06:33:16.489 PST|postgres|postgres|2748|63f384ac.abc|LOG:  
disconnection: session time: 0:00:00.007 user=postgres database=postgres 
host=127.0.0.1 port=55163
2023-02-20 06:33:16.490 PST|postgres|postgres|3552|63f384ac.de0|LOG:  
statement: SELECT pg_backend_pid()
2023-02-20 06:33:16.491 PST|postgres|postgres|3552|63f384ac.de0|LOG:  
disconnection: session time: 0:00:00.008 user=postgres database=postgres 
host=127.0.0.1 port=55164
2023-02-20 06:33:16.503 PST|[unknown]|[unknown]|244|63f384ac.f4|LOG:  
connection received: host=127.0.0.1 port=55162
...
(note that even session IDs are the same)

Though I've got no that assert yet, thus maybe as you said, the pid 
duplication
is not the (only) condition that triggered it (or maybe something is 
wrong on my side).

Interestingly, but I can get the expected duplicates only if I start the 
VS prompt
and run the test immediately after a logon (or VM reboot).
After some activity (or may be some time), the test can run for 30 minutes
without success (maybe it depends on OS cache...).

Also with a similar TAP test I discovered that on Linux (Debian 11 
32-bit) pids
are generated sequentially:
# pid: 329
# pid: 331
# pid: 333
# pid: 336
# pid: 338
# pid: 340
# pid: 343
# pid: 345
# pid: 349
# pid: 353
So with an extra small max_pid (<360) I just get
"could not fork new process for connection: Resource temporarily 
unavailable'"
before two processes, that started one after another, get the same pid.

But on Windows the sequence looks random:
# pid: 1736
# pid: 8168
# pid: 3764
# pid: 7180
# pid: 3724
# pid: 5372
# pid: 1588
# pid: 4188
# pid: 1404
# pid: 5280

Best regards,
Alexander
Attachment

Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

From
Alexander Lakhin
Date:
12.03.2023 10:18, Thomas Munro wrote:
> And again:
>
> TRAP: failed Assert("PMSignalState->PMChildFlags[slot] ==
> PM_CHILD_ASSIGNED"), File: "../src/backend/storage/ipc/pmsigna...
>
> https://cirrus-ci.com/task/6558324615806976
>
https://api.cirrus-ci.com/v1/artifact/task/6558324615806976/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/002_pg_upgrade_old_node.log
>
https://api.cirrus-ci.com/v1/artifact/task/6558324615806976/crashlog/crashlog-postgres.exe_0974_2023-03-11_13-57-27-982.txt

Here we have duplicate PIDs too:
...
2023-03-11 13:57:21.277 GMT [2152][client backend] [pg_regress/union][:0] LOG:  
disconnection: session time: 0:00:00.268 user=SYSTEM database=regression 
host=[local]
...
2023-03-11 13:57:22.320 GMT [4340][client backend] [pg_regress/join][8/947:0] 
LOG:  statement: set enable_hashjoin to 0;
TRAP: failed Assert("PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED"), 
File: "../src/backend/storage/ipc/pmsignal.c", Line: 329, PID: 2152

And I see the following code in postmaster.c:
CleanupBackend(int pid,
                int exitstatus)    /* child's exit status. */
{
...
     dlist_foreach_modify(iter, &BackendList)
     {
         Backend    *bp = dlist_container(Backend, elem, iter.cur);
         if (bp->pid == pid)
         {
             if (!bp->dead_end)
             {
                 if (!ReleasePostmasterChildSlot(bp->child_slot))
...

so if a backend with the same PID happened to start (but not reached
InitProcess() yet), when CleanBackend() is called to clean after a just
finished backend, the slot of the starting one will be released.

I am yet to construct a reproduction of the case, but it seems to me that
the race condition is not impossible here.

Best regards,
Alexander



Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

From
Andres Freund
Date:
Hi,

On 2023-03-13 17:00:00 +0300, Alexander Lakhin wrote:
> 12.03.2023 10:18, Thomas Munro wrote:
> > And again:
> >
> > TRAP: failed Assert("PMSignalState->PMChildFlags[slot] ==
> > PM_CHILD_ASSIGNED"), File: "../src/backend/storage/ipc/pmsigna...
> >
> > https://cirrus-ci.com/task/6558324615806976
> >
https://api.cirrus-ci.com/v1/artifact/task/6558324615806976/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/002_pg_upgrade_old_node.log
> >
https://api.cirrus-ci.com/v1/artifact/task/6558324615806976/crashlog/crashlog-postgres.exe_0974_2023-03-11_13-57-27-982.txt
>
> Here we have duplicate PIDs too:
> ...
> 2023-03-11 13:57:21.277 GMT [2152][client backend] [pg_regress/union][:0]
> LOG:  disconnection: session time: 0:00:00.268 user=SYSTEM
> database=regression host=[local]
> ...
> 2023-03-11 13:57:22.320 GMT [4340][client backend]
> [pg_regress/join][8/947:0] LOG:  statement: set enable_hashjoin to 0;
> TRAP: failed Assert("PMSignalState->PMChildFlags[slot] ==
> PM_CHILD_ASSIGNED"), File: "../src/backend/storage/ipc/pmsignal.c", Line:
> 329, PID: 2152
>
> And I see the following code in postmaster.c:
> CleanupBackend(int pid,
>                int exitstatus)    /* child's exit status. */
> {
> ...
>     dlist_foreach_modify(iter, &BackendList)
>     {
>         Backend    *bp = dlist_container(Backend, elem, iter.cur);
>         if (bp->pid == pid)
>         {
>             if (!bp->dead_end)
>             {
>                 if (!ReleasePostmasterChildSlot(bp->child_slot))
> ...
>
> so if a backend with the same PID happened to start (but not reached
> InitProcess() yet), when CleanBackend() is called to clean after a just
> finished backend, the slot of the starting one will be released.

On unix that ought to be unreachable, because we haven't yet reaped the dead
process. But I suspect that there currently is no such guarantee on
windows. Which seems broken.

On windows it looks like pids can't be reused as long as there are handles for
the process. Unfortunately, we close the handle for the process in
pgwin32_deadchild_callback(), which runs in a separate thread, so the pid can
be reused before we get to waitpid(). And thus it can happen while we start
new children.

I think we need to remove the CloseHandle() from
pgwin32_deadchild_callback(). Likely pgwin32_deadchild_callback() shouldn't do
anything other than
UnregisterWaitEx();PostQueuedCompletionStatus(key=childinfo),
pg_queue_signal(), with everything else moved to waitpid().


> I am yet to construct a reproduction of the case, but it seems to me that
> the race condition is not impossible here.

I suspect the issue could be made much more likely by adding a sleep before
the pg_queue_signal(SIGCHLD) in pgwin32_deadchild_callback().

Greetings,

Andres Freund



On Tue, Mar 14, 2023 at 11:20 AM Andres Freund <andres@anarazel.de> wrote:
> On windows it looks like pids can't be reused as long as there are handles for
> the process. Unfortunately, we close the handle for the process in
> pgwin32_deadchild_callback(), which runs in a separate thread, so the pid can
> be reused before we get to waitpid(). And thus it can happen while we start
> new children.

Ahhh.  Right, of course.  The handle thing makes total sense now that
you point it out, and although I couldn't find it in the fine manual,
a higher authority has it in black and white[1].  Even without knowing
which of those calls is releasing the process table entry, we're doing
all of them on the wrong side of that IOCP.  Alright, here is a patch
to schlep most of that code over into waitpid(), where it belongs.

[1] https://devblogs.microsoft.com/oldnewthing/20110107-00/?p=11803

Attachment

Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

From
Michael Paquier
Date:
On Tue, Mar 14, 2023 at 01:01:28PM +1300, Thomas Munro wrote:
> Ahhh.  Right, of course.  The handle thing makes total sense now that
> you point it out, and although I couldn't find it in the fine manual,
> a higher authority has it in black and white[1].  Even without knowing
> which of those calls is releasing the process table entry, we're doing
> all of them on the wrong side of that IOCP.  Alright, here is a patch
> to schlep most of that code over into waitpid(), where it belongs.
>
> [1] https://devblogs.microsoft.com/oldnewthing/20110107-00/?p=11803

I have a small question here..

The emulation of waitpid() for WIN32 is now in postmaster.c.  Could it
make sense for some of the frontend code to be able to rely on that,
as well?
--
Michael

Attachment

Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

From
Andres Freund
Date:
On 2023-03-14 09:29:56 +0900, Michael Paquier wrote:
> The emulation of waitpid() for WIN32 is now in postmaster.c.  Could it
> make sense for some of the frontend code to be able to rely on that,
> as well?

Please not as part of this bugfix. It's intricately tied to postmaster.c
specific code, as it is.



Here's a better version with more explicit comments about some
details.  It passes on CI.  Planning to push this soon.

Attachment

Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

From
Alexander Lakhin
Date:
Hi,
14.03.2023 01:20, Andres Freund wrote:
>> I am yet to construct a reproduction of the case, but it seems to me that
>> the race condition is not impossible here.
> I suspect the issue could be made much more likely by adding a sleep before
> the pg_queue_signal(SIGCHLD) in pgwin32_deadchild_callback().

Thanks for the tip! With pg_usleep(50000) added there, I can reproduce the issue
reliably during a minute on average with the 099_check_pids.pl I posted before:
...
2023-03-15 07:26:14.301 GMT|[unknown]|[unknown]|3748|64117316.ea4|LOG:  
connection received: host=127.0.0.1 port=49902
2023-03-15 07:26:14.302 GMT|postgres|postgres|3748|64117316.ea4|LOG:  connection 
authorized: user=postgres database=postgres application_name=099_check-pids.pl
2023-03-15 07:26:14.304 GMT|postgres|postgres|3748|64117316.ea4|LOG:  statement: 
SELECT pg_backend_pid()
2023-03-15 07:26:14.305 GMT|postgres|postgres|3748|64117316.ea4|LOG:  
disconnection: session time: 0:00:00.005 user=postgres database=postgres 
host=127.0.0.1 port=49902
...
2023-03-15 07:26:25.592 GMT|[unknown]|[unknown]|3748|64117321.ea4|LOG:  
connection received: host=127.0.0.1 port=50407
TRAP: failed Assert("PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED"), 
File: "C:\src\postgresql\src\backend\storage\ipc\pmsignal.c", Line: 329, PID: 3748
abort() has been called2023-03-15 07:26:25.608 
GMT|[unknown]|[unknown]|3524|64117321.dc4|LOG:  connection received: 
host=127.0.0.1 port=50408

The result depends on some OS conditions (it reproduced pretty well
immediately after VM reboot), but it's enough to test the patch proposed.
And I can confirm that the Assert is not observed anymore (with the sleep
added after CloseHandle(childinfo->procHandle)).

Best regards,
Alexander



On Wed, Mar 15, 2023 at 9:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> The result depends on some OS conditions (it reproduced pretty well
> immediately after VM reboot), but it's enough to test the patch proposed.
> And I can confirm that the Assert is not observed anymore (with the sleep
> added after CloseHandle(childinfo->procHandle)).

Thanks for confirming.  Pushed earlier today.

Do you know how it fails in non-assert builds, without the fix?



Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

From
Alexander Lakhin
Date:
Hi,
15.03.2023 11:43, Thomas Munro wrote:
> On Wed, Mar 15, 2023 at 9:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>> The result depends on some OS conditions (it reproduced pretty well
>> immediately after VM reboot), but it's enough to test the patch proposed.
>> And I can confirm that the Assert is not observed anymore (with the sleep
>> added after CloseHandle(childinfo->procHandle)).
> Thanks for confirming.  Pushed earlier today.
>
> Do you know how it fails in non-assert builds, without the fix?

I've replaced the Assert with 'if (!...) elog(...)' and got (with a non-assert 
build):
t/099_check-pids.pl .. ok
All tests successful.
Files=1, Tests=1, 67 wallclock secs ( 0.03 usr +  0.00 sys =  0.03 CPU)
Result: PASS
2023-03-15 12:22:46.923 GMT|postgres|postgres|4484|6411b896.1184|LOG: 
!(PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED)
2023-03-15 12:22:47.806 GMT|postgres|postgres|4180|6411b897.1054|LOG: 
!(PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED)
2023-03-15 12:23:06.313 GMT|postgres|postgres|4116|6411b8aa.1014|LOG: 
!(PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED)
2023-03-15 12:23:06.374 GMT|postgres|postgres|4740|6411b8aa.1284|LOG: 
!(PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED)
2023-03-15 12:23:25.967 GMT|postgres|postgres|6812|6411b8bd.1a9c|LOG: 
!(PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED)

So at least with my test script that doesn't lead to a crash or something.

Best regards,
Alexander



On Thu, Mar 16, 2023 at 2:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> 15.03.2023 11:43, Thomas Munro wrote:
> > Do you know how it fails in non-assert builds, without the fix?

> So at least with my test script that doesn't lead to a crash or something.

Thanks.  We were wondering if the retry mechanism might somehow be
hiding this in non-assert builds, but, looking more closely, that is
tied specifically to the memory reservation operation.

I noticed that d41a178b missed a comment explaining why we used
malloc() instead of palloc(), but that isn't true anymore, so here's a
small patch to clean that up.

Attachment

Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED

From
Alexander Lakhin
Date:
16.03.2023 00:13, Thomas Munro wrote:
> Thanks.  We were wondering if the retry mechanism might somehow be
> hiding this in non-assert builds, but, looking more closely, that is
> tied specifically to the memory reservation operation.

As to hiding, when analyzing the Assert issue, I was wondered if 
PMSignalShmemInit() can hide an error when calling ShmemInitStruct() for backends?
If I understand correctly, backends get PMSignalState through backend_variables, 
so maybe ShmemInitStruct() should not be executed for backends at all?

Best regards,
Alexander