Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED |
Date | |
Msg-id | 20230313222008.q2q3o6d7fhezdbzr@awork3.anarazel.de Whole thread Raw |
In response to | Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED (Alexander Lakhin <exclusion@gmail.com>) |
Responses |
Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED |
List | pgsql-hackers |
Hi, On 2023-03-13 17:00:00 +0300, Alexander Lakhin wrote: > 12.03.2023 10:18, Thomas Munro wrote: > > And again: > > > > TRAP: failed Assert("PMSignalState->PMChildFlags[slot] == > > PM_CHILD_ASSIGNED"), File: "../src/backend/storage/ipc/pmsigna... > > > > https://cirrus-ci.com/task/6558324615806976 > > https://api.cirrus-ci.com/v1/artifact/task/6558324615806976/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/002_pg_upgrade_old_node.log > > https://api.cirrus-ci.com/v1/artifact/task/6558324615806976/crashlog/crashlog-postgres.exe_0974_2023-03-11_13-57-27-982.txt > > Here we have duplicate PIDs too: > ... > 2023-03-11 13:57:21.277 GMT [2152][client backend] [pg_regress/union][:0] > LOG: disconnection: session time: 0:00:00.268 user=SYSTEM > database=regression host=[local] > ... > 2023-03-11 13:57:22.320 GMT [4340][client backend] > [pg_regress/join][8/947:0] LOG: statement: set enable_hashjoin to 0; > TRAP: failed Assert("PMSignalState->PMChildFlags[slot] == > PM_CHILD_ASSIGNED"), File: "../src/backend/storage/ipc/pmsignal.c", Line: > 329, PID: 2152 > > And I see the following code in postmaster.c: > CleanupBackend(int pid, > int exitstatus) /* child's exit status. */ > { > ... > dlist_foreach_modify(iter, &BackendList) > { > Backend *bp = dlist_container(Backend, elem, iter.cur); > if (bp->pid == pid) > { > if (!bp->dead_end) > { > if (!ReleasePostmasterChildSlot(bp->child_slot)) > ... > > so if a backend with the same PID happened to start (but not reached > InitProcess() yet), when CleanBackend() is called to clean after a just > finished backend, the slot of the starting one will be released. On unix that ought to be unreachable, because we haven't yet reaped the dead process. But I suspect that there currently is no such guarantee on windows. Which seems broken. On windows it looks like pids can't be reused as long as there are handles for the process. Unfortunately, we close the handle for the process in pgwin32_deadchild_callback(), which runs in a separate thread, so the pid can be reused before we get to waitpid(). And thus it can happen while we start new children. I think we need to remove the CloseHandle() from pgwin32_deadchild_callback(). Likely pgwin32_deadchild_callback() shouldn't do anything other than UnregisterWaitEx();PostQueuedCompletionStatus(key=childinfo), pg_queue_signal(), with everything else moved to waitpid(). > I am yet to construct a reproduction of the case, but it seems to me that > the race condition is not impossible here. I suspect the issue could be made much more likely by adding a sleep before the pg_queue_signal(SIGCHLD) in pgwin32_deadchild_callback(). Greetings, Andres Freund
pgsql-hackers by date: