Thread: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
Hi, A recent cfbot run caused CI on windows to crash - on a patch that could not conceivably cause this issue: https://cirrus-ci.com/task/5646021133336576 the patch is just: https://github.com/postgresql-cfbot/postgresql/commit/dbd4afa6e7583c036b86abe2e3d27b508d335c2b regression.diffs: https://api.cirrus-ci.com/v1/artifact/task/5646021133336576/testrun/build/testrun/regress/regress/regression.diffs postmaster.log: https://api.cirrus-ci.com/v1/artifact/task/5646021133336576/testrun/build/testrun/regress/regress/log/postmaster.log crash info: https://api.cirrus-ci.com/v1/artifact/task/5646021133336576/crashlog/crashlog-postgres.exe_1af0_2023-02-08_00-53-23-997.txt 00000085`f03ffa40 00007ff6`fd89faa8 ucrtbased!abort(void)+0x5a [minkernel\crts\ucrt\src\appcrt\startup\abort.cpp @ 77] 00000085`f03ffa80 00007ff6`fd6474dc postgres!ExceptionalCondition( char * conditionName = 0x00007ff6`fdd03ca8 "PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED", char * fileName = 0x00007ff6`fdd03c80 "../src/backend/storage/ipc/pmsignal.c", int lineNumber = 0n329)+0x78 [c:\cirrus\src\backend\utils\error\assert.c @ 67] 00000085`f03ffac0 00007ff6`fd676eff postgres!MarkPostmasterChildActive(void)+0x7c [c:\cirrus\src\backend\storage\ipc\pmsignal.c@ 329] 00000085`f03ffb00 00007ff6`fd59aa3a postgres!InitProcess(void)+0x2ef [c:\cirrus\src\backend\storage\lmgr\proc.c @ 375] 00000085`f03ffb60 00007ff6`fd467689 postgres!SubPostmasterMain( int argc = 0n3, char ** argv = 0x000001c6`f3814e80)+0x33a [c:\cirrus\src\backend\postmaster\postmaster.c @ 4962] 00000085`f03ffd90 00007ff6`fda0e1c9 postgres!main( int argc = 0n3, char ** argv = 0x000001c6`f3814e80)+0x2f9 [c:\cirrus\src\backend\main\main.c @ 192] So, somehow we ended up a pmsignal slot for a new backend that's not currently in PM_CHILD_ASSIGNED state. Obviously the first idea is to wonder whether this is a problem introduced as part of the the recent postmaster-latchification work. At first I thought we were failing to terminate running processes, due to the following output: parallel group (20 tests): name char txid text varchar enum float8 regproc int2 boolean bit oid pg_lsn int8 int4 float4uuid rangetypes numeric money boolean ... ok 684 ms char ... ok 517 ms name ... ok 354 ms varchar ... ok 604 ms text ... ok 603 ms int2 ... ok 676 ms int4 ... ok 818 ms int8 ... ok 779 ms oid ... ok 720 ms float4 ... ok 823 ms float8 ... ok 628 ms bit ... ok 666 ms numeric ... ok 1132 ms txid ... ok 497 ms uuid ... ok 818 ms enum ... ok 619 ms money ... FAILED (test process exited with exit code 2) 7337 ms rangetypes ... ok 813 ms pg_lsn ... ok 762 ms regproc ... ok 632 ms But now I realize the reason none of the other tests failed, is because the crash took a long time, presumably due to the debugger creating the above information, so no other tests failed. 2023-02-08 00:53:20.257 GMT client backend[4584] pg_regress/rangetypes STATEMENT: select '-[a,z)'::textrange; TRAP: failed Assert("PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED"), File: "../src/backend/storage/ipc/pmsignal.c",Line: 329, PID: 5948 [ quite a few lines ] 2023-02-08 00:53:27.420 GMT postmaster[872] LOG: server process (PID 5948) was terminated by exception 0xC0000354 2023-02-08 00:53:27.420 GMT postmaster[872] HINT: See C include file "ntstatus.h" for a description of the hexadecimal value. 2023-02-08 00:53:27.420 GMT postmaster[872] LOG: terminating any other active server processes 2023-02-08 00:53:27.434 GMT postmaster[872] LOG: all server processes terminated; reinitializing 2023-02-08 00:53:27.459 GMT startup[5800] LOG: database system was interrupted; last known up at 2023-02-08 00:53:19 GMT 2023-02-08 00:53:27.459 GMT startup[5800] LOG: database system was not properly shut down; automatic recovery in progress 2023-02-08 00:53:27.462 GMT startup[5800] LOG: redo starts at 0/20DCF08 2023-02-08 00:53:27.484 GMT startup[5800] LOG: could not stat file "pg_tblspc/16502": No such file or directory 2023-02-08 00:53:27.484 GMT startup[5800] CONTEXT: WAL redo at 0/20DCFB8 for Tablespace/DROP: 16502 2023-02-08 00:53:27.614 GMT startup[5800] LOG: invalid record length at 0/25353E8: wanted 24, got 0 2023-02-08 00:53:27.614 GMT startup[5800] LOG: redo done at 0/2534FE0 system usage: CPU: user: 0.04 s, system: 0.04 s, elapsed:0.15 s Nevertheless, clearly this should never be reached. Greetings, Andres Freund
Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
From
Thomas Munro
Date:
On Wed, Feb 8, 2023 at 2:28 PM Andres Freund <andres@anarazel.de> wrote: > 2023-02-08 00:53:20.257 GMT client backend[4584] pg_regress/rangetypes STATEMENT: select '-[a,z)'::textrange; > TRAP: failed Assert("PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED"), File: "../src/backend/storage/ipc/pmsignal.c",Line: 329, PID: 5948 No idea what's going on yet, but this assertion failure is very familiar to me, as one of the ways that lorikeet fails/failed (though it hasn't failed like that since the postmaster latchification). There it was because Cygwin's signal blocking is unreliable, so the postmaster could start a backend, while already being in the middle of starting a backend. That particular problem shouldn't be possible anymore; now we can only start backends from inside the main event loop. Hmm. (State machine bug? Some confusion about processes caused by the fact that PID was recycled?)
Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
From
Thomas Munro
Date:
On Wed, Feb 8, 2023 at 2:52 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Wed, Feb 8, 2023 at 2:28 PM Andres Freund <andres@anarazel.de> wrote: > > TRAP: failed Assert("PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED"), File: "../src/backend/storage/ipc/pmsignal.c",Line: 329, PID: 5948 I was wondering if commit 18a4a620 might be relevant, as it touched the management of those slots a few months ago, but then I found a couple of matches from 2021 in my database of build farm assertion failures: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-08-12%2010:38:56 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-09-28%2016:40:49 Fairywren is a Windows + msys2 system, so it doesn't share Cygwin's signal system, it's running the pure Windows code (though it's GCC instead of MSVC and has a different libc, it's using our Windows native code paths and defines WIN32).
Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
From
Thomas Munro
Date:
On Wed, Feb 8, 2023 at 4:00 PM Thomas Munro <thomas.munro@gmail.com> wrote: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-08-12%2010:38:56 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-09-28%2016:40:49 These were a bit different though. They also logged "could not reserve shared memory region". And they don't have a user of the same PID logging stuff immediately preceding the failure.
Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
From
Thomas Munro
Date:
I still have no theory for how this condition was reached despite a lot of time thinking about it and searching for more clues. As far as I can tell, the recent improvements to postmaster's signal and event handling shouldn't be related: the state management and logic was unchanged. While failing to understand this, I worked[1] on CI log indexing tool with public reports that highlight this sort of thing[2], so I'll be watching out for more evidence. Unfortunately I have no data from before 1 Feb (cfbot previously wasn't interested in the past at all; I'd need to get my hands on the commit IDs for earlier testing but I can't figure out how to get those out of Cirrus or Github -- anyone know how?). FWIW I have a thing I call bfbot for slurping up similar data from the build farm. It's not pretty enough for public consumption, but I do know that this assertion hasn't failed there, except the cases I mentioned earlier, and a load of failures on lorikeet which was completely b0rked until recently. [1] https://xkcd.com/974/ [2] http://cfbot.cputube.org/highlights/assertion-90.html
Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
From
Andres Freund
Date:
Hi, On 2023-02-18 13:27:04 +1300, Thomas Munro wrote: > I still have no theory for how this condition was reached despite a > lot of time thinking about it and searching for more clues. As far as > I can tell, the recent improvements to postmaster's signal and event > handling shouldn't be related: the state management and logic was > unchanged. Yea, it's all very odd. If you look at the log: 2023-02-08 00:53:20.175 GMT client backend[5948] pg_regress/name DETAIL: No valid identifier after ".". 2023-02-08 00:53:20.175 GMT client backend[5948] pg_regress/name STATEMENT: SELECT parse_ident('xxx.1020'); ... TRAP: failed Assert("PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED"), File: "../src/backend/storage/ipc/pmsignal.c",Line: 329, PID: 5948 abort() has been called ... 2023-02-08 00:53:27.420 GMT postmaster[872] LOG: server process (PID 5948) was terminated by exception 0xC0000354 2023-02-08 00:53:27.420 GMT postmaster[872] HINT: See C include file "ntstatus.h" for a description of the hexadecimal value. 2023-02-08 00:53:27.420 GMT postmaster[872] LOG: terminating any other active server processes 2023-02-08 00:53:27.434 GMT postmaster[872] LOG: all server processes terminated; reinitializing and that it's indeed the money test that failed: money ... FAILED (test process exited with exit code 2) 7337 ms it's very hard to understand how this stack can come to be: 00000085`f03ffa40 00007ff6`fd89faa8 ucrtbased!abort(void)+0x5a [minkernel\crts\ucrt\src\appcrt\startup\abort.cpp @ 77] 00000085`f03ffa80 00007ff6`fd6474dc postgres!ExceptionalCondition( char * conditionName = 0x00007ff6`fdd03ca8 "PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED", char * fileName = 0x00007ff6`fdd03c80 "../src/backend/storage/ipc/pmsignal.c", int lineNumber = 0n329)+0x78 [c:\cirrus\src\backend\utils\error\assert.c @ 67] 00000085`f03ffac0 00007ff6`fd676eff postgres!MarkPostmasterChildActive(void)+0x7c [c:\cirrus\src\backend\storage\ipc\pmsignal.c@ 329] 00000085`f03ffb00 00007ff6`fd59aa3a postgres!InitProcess(void)+0x2ef [c:\cirrus\src\backend\storage\lmgr\proc.c @ 375] 00000085`f03ffb60 00007ff6`fd467689 postgres!SubPostmasterMain( int argc = 0n3, char ** argv = 0x000001c6`f3814e80)+0x33a [c:\cirrus\src\backend\postmaster\postmaster.c @ 4962] 00000085`f03ffd90 00007ff6`fda0e1c9 postgres!main( int argc = 0n3, char ** argv = 0x000001c6`f3814e80)+0x2f9 [c:\cirrus\src\backend\main\main.c @ 192] How can a process that we did notify crashing, that has already executed SQL statements, end up in MarkPostmasterChildActive()? > While failing to understand this, I worked[1] on CI log indexing tool > with public reports that highlight this sort of thing[2], so I'll be > watching out for more evidence. Unfortunately I have no data from > before 1 Feb (cfbot previously wasn't interested in the past at all; > I'd need to get my hands on the commit IDs for earlier testing but I > can't figure out how to get those out of Cirrus or Github -- anyone > know how?). FWIW I have a thing I call bfbot for slurping up similar > data from the build farm. It's not pretty enough for public > consumption, but I do know that this assertion hasn't failed there, > except the cases I mentioned earlier, and a load of failures on > lorikeet which was completely b0rked until recently. > [1] https://xkcd.com/974/ > [2] http://cfbot.cputube.org/highlights/assertion-90.html I think this extremely useful. Greetings, Andres Freund
Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
From
Justin Pryzby
Date:
On Sat, Feb 18, 2023 at 01:27:04PM +1300, Thomas Munro wrote: > (cfbot previously wasn't interested in the past at all; > I'd need to get my hands on the commit IDs for earlier testing but I > can't figure out how to get those out of Cirrus or Github -- anyone > know how?). I wish I knew - my only suggestion is to scrape it out of "git reflog", but that only works if you configured it to save a huge reflog or saved the historic output of "git reflog". Or if you don't change the branch often, which I imagine doesn't hold true here. -- Justin
Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
From
Alexander Lakhin
Date:
Hello, 18.02.2023 04:06, Andres Freund wrote: > Hi, > > On 2023-02-18 13:27:04 +1300, Thomas Munro wrote: >> I still have no theory for how this condition was reached despite a >> lot of time thinking about it and searching for more clues. As far as >> I can tell, the recent improvements to postmaster's signal and event >> handling shouldn't be related: the state management and logic was >> unchanged. > Yea, it's all very odd. > > If you look at the log: > > 2023-02-08 00:53:20.175 GMT client backend[5948] pg_regress/name DETAIL: No valid identifier after ".". > 2023-02-08 00:53:20.175 GMT client backend[5948] pg_regress/name STATEMENT: SELECT parse_ident('xxx.1020'); > ... > TRAP: failed Assert("PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED"), File: "../src/backend/storage/ipc/pmsignal.c",Line: 329, PID: 5948 > abort() has been called > ... > 2023-02-08 00:53:27.420 GMT postmaster[872] LOG: server process (PID 5948) was terminated by exception 0xC0000354 > 2023-02-08 00:53:27.420 GMT postmaster[872] HINT: See C include file "ntstatus.h" for a description of the hexadecimalvalue. > 2023-02-08 00:53:27.420 GMT postmaster[872] LOG: terminating any other active server processes > 2023-02-08 00:53:27.434 GMT postmaster[872] LOG: all server processes terminated; reinitializing > > > and that it's indeed the money test that failed: > money ... FAILED (test process exited with exit code 2) 7337 ms > > it's very hard to understand how this stack can come to be: > > 00000085`f03ffa40 00007ff6`fd89faa8 ucrtbased!abort(void)+0x5a [minkernel\crts\ucrt\src\appcrt\startup\abort.cpp @77] > 00000085`f03ffa80 00007ff6`fd6474dc postgres!ExceptionalCondition( > char * conditionName = 0x00007ff6`fdd03ca8 "PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED", > char * fileName = 0x00007ff6`fdd03c80 "../src/backend/storage/ipc/pmsignal.c", > int lineNumber = 0n329)+0x78 [c:\cirrus\src\backend\utils\error\assert.c @ 67] > 00000085`f03ffac0 00007ff6`fd676eff postgres!MarkPostmasterChildActive(void)+0x7c [c:\cirrus\src\backend\storage\ipc\pmsignal.c@ 329] > 00000085`f03ffb00 00007ff6`fd59aa3a postgres!InitProcess(void)+0x2ef [c:\cirrus\src\backend\storage\lmgr\proc.c @ 375] > 00000085`f03ffb60 00007ff6`fd467689 postgres!SubPostmasterMain( > int argc = 0n3, > char ** argv = 0x000001c6`f3814e80)+0x33a [c:\cirrus\src\backend\postmaster\postmaster.c @ 4962] > 00000085`f03ffd90 00007ff6`fda0e1c9 postgres!main( > int argc = 0n3, > char ** argv = 0x000001c6`f3814e80)+0x2f9 [c:\cirrus\src\backend\main\main.c @ 192] > > How can a process that we did notify crashing, that has already executed SQL > statements, end up in MarkPostmasterChildActive()? Maybe it's just the backend started for the money test has got the same PID (5948) that the backend for the name test had? A simple script that I've found [1] shows that the pids reused rather often (for me, approximately each 300 process starts in Windows 10 H2), buy maybe under some circumstances (many concurrent processes?) PIDs can coincide even so often to trigger that behavior. [1] https://superuser.com/questions/636497/does-windows-7-reuse-process-ids
Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
From
Andres Freund
Date:
Hi, On 2023-02-18 18:00:00 +0300, Alexander Lakhin wrote: > 18.02.2023 04:06, Andres Freund wrote: > > On 2023-02-18 13:27:04 +1300, Thomas Munro wrote: > > How can a process that we did notify crashing, that has already executed > > SQL statements, end up in MarkPostmasterChildActive()? > > Maybe it's just the backend started for the money test has got > the same PID (5948) that the backend for the name test had? I somehow mashed name and money into one test in my head... So forget what I wrote. That doesn't really explain the assertion though. It's too bad that we didn't use doesn't include log_connections/log_disconnections. If nothing else, it makes it a lot easier to identify problems like that. We actually do try to configure it for CI, but it currently doesn't work for pg_regress style tests with meson. Need to fix that. Starting a thread. One thing that made me very suspicious when reading related code is this remark: bool ReleasePostmasterChildSlot(int slot) ... /* * Note: the slot state might already be unused, because the logic in * postmaster.c is such that this might get called twice when a child * crashes. So we don't try to Assert anything about the state. */ That seems fragile, and potentially racy. What if we somehow can end up starting another backend inbetween the two ReleasePostmasterChildSlot() calls, we can end up marking a slot that, newly, has a process associated with it, as inactive? Once the slot has been released the first time, it can be assigned again. ISTM that it's not a good idea that we use PM_CHILD_ASSIGNED to signal both, that a slot has not been used yet, and that it's not in use anymore. I think that makes it quite a bit harder to find state management issues. > A simple script that I've found [1] shows that the pids reused rather often > (for me, approximately each 300 process starts in Windows 10 H2), buy maybe > under some circumstances (many concurrent processes?) PIDs can coincide even > so often to trigger that behavior. It's definitely very aggressive in reusing pids - and it seems to intentionally do work to keep pids small. I wonder if it'd be worth trying to exercise this path aggressively by configuring a very low max pid on linux, in an EXEC_BACKEND build. Greetings, Andres Freund
Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
From
Andrew Dunstan
Date:
On 2023-02-17 Fr 19:27, Thomas Munro wrote:
FWIW I have a thing I call bfbot for slurping up similar data from the build farm. It's not pretty enough for public consumption, but I do know that this assertion hasn't failed there, except the cases I mentioned earlier, and a load of failures on lorikeet which was completely b0rked until recently.
Are there things we need to do on the server side to make data extraction easier?
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
From
Thomas Munro
Date:
On Mon, Feb 20, 2023 at 2:46 AM Andrew Dunstan <andrew@dunslane.net> wrote: > On 2023-02-17 Fr 19:27, Thomas Munro wrote: >> FWIW I have a thing I call bfbot for slurping up similar >> data from the build farm. It's not pretty enough for public >> consumption, but I do know that this assertion hasn't failed there, >> except the cases I mentioned earlier, and a load of failures on >> lorikeet which was completely b0rked until recently. > > Are there things we need to do on the server side to make data extraction easier? It's a good question. One thought Andres mentioned to me is whether we might want to have an in-tree tool to find interesting stuff. That is, even locally during development, but also in the CI + buildfarm, a common tool could find and spit out human- and machine-readable highlights (backtraces, PANICs, assertions, ... like cfbot is now doing). Then the knowledge of what's interesting would be maintained and extended by all of us. On the other hand, as we think of new patterns over time to look out for, it's also nice to be able to re-scan old data to see if the new patterns occurred in the past (I've done this several times with cfbot's new highlight analyser as I corrected mistakes and added patterns). So maybe that's also a good idea, but a separate thing. Even if the analyser logic is not in-tree, we could try to make something that works pretty much the same across CI and BF. Perhaps we could think about some of those ideas once the BF is using meson? Aside from having just one system to think about, the meson build system is a bit more structured: it has a more formal concept of test suites and tests with machine readable results from the top level (JSON files etc), with names strictly corresponding to directories where the output is, etc. I think I'd basically want a complete list of available files (= like the artifacts on CI), and then I'd pull down the meson test result file and then decide which other files I also want to pull down (ie stuff relating to failed tests) to analyse. (Not that any of that is intractable with the autoconf or handrolled perl/MSVC stuff, it's just messier, and hard to get motivated when its days are numbered.) One little thing I remembered while looking into this general topic is the noise you get when we crash during pg_regress, which it'd be nice to fix: https://www.postgresql.org/message-id/flat/CA%2BhUKGL7hxqbadkto7e1FCOLQhuHg%3DwVn_PDZd6fDMbQrrZisA%40mail.gmail.com Another topic I'm interested in is how to find useful signals in the timing data. For example, when Nathan and I worked on walreceiver wakeup improvements, we didn't notice that we'd caused some tests to become dramatically slower, because of a pre-existing bug/thinko we hadn't noticed. I want a computer to tell me about this stuff. That's somewhat tricky because of all the noise, but hopefully it's not beyond the powers of statistics to notice that a test unexpectedly took a nap for 10s.
Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
From
Alexander Lakhin
Date:
18.02.2023 23:09, Andres Freund wrote: > Hi, > > On 2023-02-18 18:00:00 +0300, Alexander Lakhin wrote: >> 18.02.2023 04:06, Andres Freund wrote: >> Maybe it's just the backend started for the money test has got >> the same PID (5948) that the backend for the name test had? > I somehow mashed name and money into one test in my head... So forget what I > wrote. > > That doesn't really explain the assertion though. > ,,, > It's definitely very aggressive in reusing pids - and it seems to > intentionally do work to keep pids small. I wonder if it'd be worth trying to > exercise this path aggressively by configuring a very low max pid on linux, in > an EXEC_BACKEND build. For information, I performed the simple analysis of the postmaster.log after ordinary `vcregress check`: grep -o -E 'client backend\[[0-9]+\] pg_regress/\w+' ..\..\..\src\test\regress\log\postmaster.log | sort | uniq | wc -l grep -o -E 'client backend\[[0-9]+\] pg_regress' ..\..\..\src\test\regress\log\postmaster.log | sort | uniq | wc -l and got the following numbers: ... iteration 88 196 174 iteration 89 196 170 iteration 90 196 176 iteration 91 196 175 Then I performed some more experiments and came to the conclusion that `vcregress check` is not very suitable for catching duplicate pids. So I've wrote a simple TAP test and caught the condition sought: t/099_check_pids.pl .. # 0 # 1 # 2 # 3 # 4 # 5 # 6 # Thread 15 failed: # Got two equal pids in a row: 3552 on iteration 30 # 1 t/099_check_pids.pl .. 1/1 # Failed test at t/099_check_pids.pl line 107. ... server log contains: ... 2023-02-20 06:33:16.142 PST|[unknown]|[unknown]|5704|63f384ac.1648|LOG: connection received: host=127.0.0.1 port=55134 2023-02-20 06:33:16.142 PST|[unknown]|[unknown]|3552|63f384ac.de0|LOG: connection received: host=127.0.0.1 port=55138 2023-02-20 06:33:16.144 PST|postgres|postgres|5704|63f384ac.1648|LOG: connection authorized: user=postgres database=postgres application_name=099_check_pids.pl 2023-02-20 06:33:16.144 PST|postgres|postgres|3552|63f384ac.de0|LOG: connection authorized: user=postgres database=postgres application_name=099_check_pids.pl 2023-02-20 06:33:16.147 PST|postgres|postgres|5704|63f384ac.1648|LOG: statement: SELECT pg_backend_pid() 2023-02-20 06:33:16.147 PST|postgres|postgres|3552|63f384ac.de0|LOG: statement: SELECT pg_backend_pid() 2023-02-20 06:33:16.147 PST|postgres|postgres|5704|63f384ac.1648|LOG: disconnection: session time: 0:00:00.009 user=postgres database=postgres host=127.0.0.1 port=55134 2023-02-20 06:33:16.147 PST|postgres|postgres|3552|63f384ac.de0|LOG: disconnection: session time: 0:00:00.008 user=postgres database=postgres host=127.0.0.1 port=55138 2023-02-20 06:33:16.158 PST|[unknown]|[unknown]|1672|63f384ac.688|LOG: connection received: host=127.0.0.1 port=55139 ... 2023-02-20 06:33:16.485 PST|postgres|postgres|2748|63f384ac.abc|LOG: connection authorized: user=postgres database=postgres application_name=099_check_pids.pl 2023-02-20 06:33:16.486 PST|[unknown]|[unknown]|3552|63f384ac.de0|LOG: connection received: host=127.0.0.1 port=55164 2023-02-20 06:33:16.487 PST|postgres|postgres|3552|63f384ac.de0|LOG: connection authorized: user=postgres database=postgres application_name=099_check_pids.pl 2023-02-20 06:33:16.488 PST|postgres|postgres|2748|63f384ac.abc|LOG: statement: SELECT pg_backend_pid() 2023-02-20 06:33:16.489 PST|postgres|postgres|2748|63f384ac.abc|LOG: disconnection: session time: 0:00:00.007 user=postgres database=postgres host=127.0.0.1 port=55163 2023-02-20 06:33:16.490 PST|postgres|postgres|3552|63f384ac.de0|LOG: statement: SELECT pg_backend_pid() 2023-02-20 06:33:16.491 PST|postgres|postgres|3552|63f384ac.de0|LOG: disconnection: session time: 0:00:00.008 user=postgres database=postgres host=127.0.0.1 port=55164 2023-02-20 06:33:16.503 PST|[unknown]|[unknown]|244|63f384ac.f4|LOG: connection received: host=127.0.0.1 port=55162 ... (note that even session IDs are the same) Though I've got no that assert yet, thus maybe as you said, the pid duplication is not the (only) condition that triggered it (or maybe something is wrong on my side). Interestingly, but I can get the expected duplicates only if I start the VS prompt and run the test immediately after a logon (or VM reboot). After some activity (or may be some time), the test can run for 30 minutes without success (maybe it depends on OS cache...). Also with a similar TAP test I discovered that on Linux (Debian 11 32-bit) pids are generated sequentially: # pid: 329 # pid: 331 # pid: 333 # pid: 336 # pid: 338 # pid: 340 # pid: 343 # pid: 345 # pid: 349 # pid: 353 So with an extra small max_pid (<360) I just get "could not fork new process for connection: Resource temporarily unavailable'" before two processes, that started one after another, get the same pid. But on Windows the sequence looks random: # pid: 1736 # pid: 8168 # pid: 3764 # pid: 7180 # pid: 3724 # pid: 5372 # pid: 1588 # pid: 4188 # pid: 1404 # pid: 5280 Best regards, Alexander
Attachment
Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
From
Thomas Munro
Date:
And again: TRAP: failed Assert("PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED"), File: "../src/backend/storage/ipc/pmsigna... https://cirrus-ci.com/task/6558324615806976 https://api.cirrus-ci.com/v1/artifact/task/6558324615806976/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/002_pg_upgrade_old_node.log https://api.cirrus-ci.com/v1/artifact/task/6558324615806976/crashlog/crashlog-postgres.exe_0974_2023-03-11_13-57-27-982.txt
Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
From
Alexander Lakhin
Date:
12.03.2023 10:18, Thomas Munro wrote: > And again: > > TRAP: failed Assert("PMSignalState->PMChildFlags[slot] == > PM_CHILD_ASSIGNED"), File: "../src/backend/storage/ipc/pmsigna... > > https://cirrus-ci.com/task/6558324615806976 > https://api.cirrus-ci.com/v1/artifact/task/6558324615806976/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/002_pg_upgrade_old_node.log > https://api.cirrus-ci.com/v1/artifact/task/6558324615806976/crashlog/crashlog-postgres.exe_0974_2023-03-11_13-57-27-982.txt Here we have duplicate PIDs too: ... 2023-03-11 13:57:21.277 GMT [2152][client backend] [pg_regress/union][:0] LOG: disconnection: session time: 0:00:00.268 user=SYSTEM database=regression host=[local] ... 2023-03-11 13:57:22.320 GMT [4340][client backend] [pg_regress/join][8/947:0] LOG: statement: set enable_hashjoin to 0; TRAP: failed Assert("PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED"), File: "../src/backend/storage/ipc/pmsignal.c", Line: 329, PID: 2152 And I see the following code in postmaster.c: CleanupBackend(int pid, int exitstatus) /* child's exit status. */ { ... dlist_foreach_modify(iter, &BackendList) { Backend *bp = dlist_container(Backend, elem, iter.cur); if (bp->pid == pid) { if (!bp->dead_end) { if (!ReleasePostmasterChildSlot(bp->child_slot)) ... so if a backend with the same PID happened to start (but not reached InitProcess() yet), when CleanBackend() is called to clean after a just finished backend, the slot of the starting one will be released. I am yet to construct a reproduction of the case, but it seems to me that the race condition is not impossible here. Best regards, Alexander
Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
From
Andres Freund
Date:
Hi, On 2023-03-13 17:00:00 +0300, Alexander Lakhin wrote: > 12.03.2023 10:18, Thomas Munro wrote: > > And again: > > > > TRAP: failed Assert("PMSignalState->PMChildFlags[slot] == > > PM_CHILD_ASSIGNED"), File: "../src/backend/storage/ipc/pmsigna... > > > > https://cirrus-ci.com/task/6558324615806976 > > https://api.cirrus-ci.com/v1/artifact/task/6558324615806976/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/002_pg_upgrade_old_node.log > > https://api.cirrus-ci.com/v1/artifact/task/6558324615806976/crashlog/crashlog-postgres.exe_0974_2023-03-11_13-57-27-982.txt > > Here we have duplicate PIDs too: > ... > 2023-03-11 13:57:21.277 GMT [2152][client backend] [pg_regress/union][:0] > LOG: disconnection: session time: 0:00:00.268 user=SYSTEM > database=regression host=[local] > ... > 2023-03-11 13:57:22.320 GMT [4340][client backend] > [pg_regress/join][8/947:0] LOG: statement: set enable_hashjoin to 0; > TRAP: failed Assert("PMSignalState->PMChildFlags[slot] == > PM_CHILD_ASSIGNED"), File: "../src/backend/storage/ipc/pmsignal.c", Line: > 329, PID: 2152 > > And I see the following code in postmaster.c: > CleanupBackend(int pid, > int exitstatus) /* child's exit status. */ > { > ... > dlist_foreach_modify(iter, &BackendList) > { > Backend *bp = dlist_container(Backend, elem, iter.cur); > if (bp->pid == pid) > { > if (!bp->dead_end) > { > if (!ReleasePostmasterChildSlot(bp->child_slot)) > ... > > so if a backend with the same PID happened to start (but not reached > InitProcess() yet), when CleanBackend() is called to clean after a just > finished backend, the slot of the starting one will be released. On unix that ought to be unreachable, because we haven't yet reaped the dead process. But I suspect that there currently is no such guarantee on windows. Which seems broken. On windows it looks like pids can't be reused as long as there are handles for the process. Unfortunately, we close the handle for the process in pgwin32_deadchild_callback(), which runs in a separate thread, so the pid can be reused before we get to waitpid(). And thus it can happen while we start new children. I think we need to remove the CloseHandle() from pgwin32_deadchild_callback(). Likely pgwin32_deadchild_callback() shouldn't do anything other than UnregisterWaitEx();PostQueuedCompletionStatus(key=childinfo), pg_queue_signal(), with everything else moved to waitpid(). > I am yet to construct a reproduction of the case, but it seems to me that > the race condition is not impossible here. I suspect the issue could be made much more likely by adding a sleep before the pg_queue_signal(SIGCHLD) in pgwin32_deadchild_callback(). Greetings, Andres Freund
Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
From
Thomas Munro
Date:
On Tue, Mar 14, 2023 at 11:20 AM Andres Freund <andres@anarazel.de> wrote: > On windows it looks like pids can't be reused as long as there are handles for > the process. Unfortunately, we close the handle for the process in > pgwin32_deadchild_callback(), which runs in a separate thread, so the pid can > be reused before we get to waitpid(). And thus it can happen while we start > new children. Ahhh. Right, of course. The handle thing makes total sense now that you point it out, and although I couldn't find it in the fine manual, a higher authority has it in black and white[1]. Even without knowing which of those calls is releasing the process table entry, we're doing all of them on the wrong side of that IOCP. Alright, here is a patch to schlep most of that code over into waitpid(), where it belongs. [1] https://devblogs.microsoft.com/oldnewthing/20110107-00/?p=11803
Attachment
Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
From
Michael Paquier
Date:
On Tue, Mar 14, 2023 at 01:01:28PM +1300, Thomas Munro wrote: > Ahhh. Right, of course. The handle thing makes total sense now that > you point it out, and although I couldn't find it in the fine manual, > a higher authority has it in black and white[1]. Even without knowing > which of those calls is releasing the process table entry, we're doing > all of them on the wrong side of that IOCP. Alright, here is a patch > to schlep most of that code over into waitpid(), where it belongs. > > [1] https://devblogs.microsoft.com/oldnewthing/20110107-00/?p=11803 I have a small question here.. The emulation of waitpid() for WIN32 is now in postmaster.c. Could it make sense for some of the frontend code to be able to rely on that, as well? -- Michael
Attachment
Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
From
Andres Freund
Date:
On 2023-03-14 09:29:56 +0900, Michael Paquier wrote: > The emulation of waitpid() for WIN32 is now in postmaster.c. Could it > make sense for some of the frontend code to be able to rely on that, > as well? Please not as part of this bugfix. It's intricately tied to postmaster.c specific code, as it is.
Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
From
Thomas Munro
Date:
Here's a better version with more explicit comments about some details. It passes on CI. Planning to push this soon.
Attachment
Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
From
Alexander Lakhin
Date:
Hi, 14.03.2023 01:20, Andres Freund wrote: >> I am yet to construct a reproduction of the case, but it seems to me that >> the race condition is not impossible here. > I suspect the issue could be made much more likely by adding a sleep before > the pg_queue_signal(SIGCHLD) in pgwin32_deadchild_callback(). Thanks for the tip! With pg_usleep(50000) added there, I can reproduce the issue reliably during a minute on average with the 099_check_pids.pl I posted before: ... 2023-03-15 07:26:14.301 GMT|[unknown]|[unknown]|3748|64117316.ea4|LOG: connection received: host=127.0.0.1 port=49902 2023-03-15 07:26:14.302 GMT|postgres|postgres|3748|64117316.ea4|LOG: connection authorized: user=postgres database=postgres application_name=099_check-pids.pl 2023-03-15 07:26:14.304 GMT|postgres|postgres|3748|64117316.ea4|LOG: statement: SELECT pg_backend_pid() 2023-03-15 07:26:14.305 GMT|postgres|postgres|3748|64117316.ea4|LOG: disconnection: session time: 0:00:00.005 user=postgres database=postgres host=127.0.0.1 port=49902 ... 2023-03-15 07:26:25.592 GMT|[unknown]|[unknown]|3748|64117321.ea4|LOG: connection received: host=127.0.0.1 port=50407 TRAP: failed Assert("PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED"), File: "C:\src\postgresql\src\backend\storage\ipc\pmsignal.c", Line: 329, PID: 3748 abort() has been called2023-03-15 07:26:25.608 GMT|[unknown]|[unknown]|3524|64117321.dc4|LOG: connection received: host=127.0.0.1 port=50408 The result depends on some OS conditions (it reproduced pretty well immediately after VM reboot), but it's enough to test the patch proposed. And I can confirm that the Assert is not observed anymore (with the sleep added after CloseHandle(childinfo->procHandle)). Best regards, Alexander
Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
From
Thomas Munro
Date:
On Wed, Mar 15, 2023 at 9:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: > The result depends on some OS conditions (it reproduced pretty well > immediately after VM reboot), but it's enough to test the patch proposed. > And I can confirm that the Assert is not observed anymore (with the sleep > added after CloseHandle(childinfo->procHandle)). Thanks for confirming. Pushed earlier today. Do you know how it fails in non-assert builds, without the fix?
Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
From
Alexander Lakhin
Date:
Hi, 15.03.2023 11:43, Thomas Munro wrote: > On Wed, Mar 15, 2023 at 9:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: >> The result depends on some OS conditions (it reproduced pretty well >> immediately after VM reboot), but it's enough to test the patch proposed. >> And I can confirm that the Assert is not observed anymore (with the sleep >> added after CloseHandle(childinfo->procHandle)). > Thanks for confirming. Pushed earlier today. > > Do you know how it fails in non-assert builds, without the fix? I've replaced the Assert with 'if (!...) elog(...)' and got (with a non-assert build): t/099_check-pids.pl .. ok All tests successful. Files=1, Tests=1, 67 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU) Result: PASS 2023-03-15 12:22:46.923 GMT|postgres|postgres|4484|6411b896.1184|LOG: !(PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED) 2023-03-15 12:22:47.806 GMT|postgres|postgres|4180|6411b897.1054|LOG: !(PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED) 2023-03-15 12:23:06.313 GMT|postgres|postgres|4116|6411b8aa.1014|LOG: !(PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED) 2023-03-15 12:23:06.374 GMT|postgres|postgres|4740|6411b8aa.1284|LOG: !(PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED) 2023-03-15 12:23:25.967 GMT|postgres|postgres|6812|6411b8bd.1a9c|LOG: !(PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED) So at least with my test script that doesn't lead to a crash or something. Best regards, Alexander
Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
From
Thomas Munro
Date:
On Thu, Mar 16, 2023 at 2:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > 15.03.2023 11:43, Thomas Munro wrote: > > Do you know how it fails in non-assert builds, without the fix? > So at least with my test script that doesn't lead to a crash or something. Thanks. We were wondering if the retry mechanism might somehow be hiding this in non-assert builds, but, looking more closely, that is tied specifically to the memory reservation operation. I noticed that d41a178b missed a comment explaining why we used malloc() instead of palloc(), but that isn't true anymore, so here's a small patch to clean that up.
Attachment
Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
From
Alexander Lakhin
Date:
16.03.2023 00:13, Thomas Munro wrote: > Thanks. We were wondering if the retry mechanism might somehow be > hiding this in non-assert builds, but, looking more closely, that is > tied specifically to the memory reservation operation. As to hiding, when analyzing the Assert issue, I was wondered if PMSignalShmemInit() can hide an error when calling ShmemInitStruct() for backends? If I understand correctly, backends get PMSignalState through backend_variables, so maybe ShmemInitStruct() should not be executed for backends at all? Best regards, Alexander