Home > mailing lists

Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED - Mailing list pgsql-hackers

From	Andres Freund
Subject	Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
Date	March 14, 2023 01:20:08
Msg-id	20230313222008.q2q3o6d7fhezdbzr@awork3.anarazel.de Whole thread Raw
In response to	Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED (Alexander Lakhin <exclusion@gmail.com>)
Responses	Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
List	pgsql-hackers

Tree view

Hi,

On 2023-03-13 17:00:00 +0300, Alexander Lakhin wrote:
> 12.03.2023 10:18, Thomas Munro wrote:
> > And again:
> >
> > TRAP: failed Assert("PMSignalState->PMChildFlags[slot] ==
> > PM_CHILD_ASSIGNED"), File: "../src/backend/storage/ipc/pmsigna...
> >
> > https://cirrus-ci.com/task/6558324615806976
> >
https://api.cirrus-ci.com/v1/artifact/task/6558324615806976/testrun/build/testrun/pg_upgrade/002_pg_upgrade/log/002_pg_upgrade_old_node.log
> >
https://api.cirrus-ci.com/v1/artifact/task/6558324615806976/crashlog/crashlog-postgres.exe_0974_2023-03-11_13-57-27-982.txt
>
> Here we have duplicate PIDs too:
> ...
> 2023-03-11 13:57:21.277 GMT [2152][client backend] [pg_regress/union][:0]
> LOG:  disconnection: session time: 0:00:00.268 user=SYSTEM
> database=regression host=[local]
> ...
> 2023-03-11 13:57:22.320 GMT [4340][client backend]
> [pg_regress/join][8/947:0] LOG:  statement: set enable_hashjoin to 0;
> TRAP: failed Assert("PMSignalState->PMChildFlags[slot] ==
> PM_CHILD_ASSIGNED"), File: "../src/backend/storage/ipc/pmsignal.c", Line:
> 329, PID: 2152
>
> And I see the following code in postmaster.c:
> CleanupBackend(int pid,
>                int exitstatus)    /* child's exit status. */
> {
> ...
>     dlist_foreach_modify(iter, &BackendList)
>     {
>         Backend    *bp = dlist_container(Backend, elem, iter.cur);
>         if (bp->pid == pid)
>         {
>             if (!bp->dead_end)
>             {
>                 if (!ReleasePostmasterChildSlot(bp->child_slot))
> ...
>
> so if a backend with the same PID happened to start (but not reached
> InitProcess() yet), when CleanBackend() is called to clean after a just
> finished backend, the slot of the starting one will be released.

On unix that ought to be unreachable, because we haven't yet reaped the dead
process. But I suspect that there currently is no such guarantee on
windows. Which seems broken.

On windows it looks like pids can't be reused as long as there are handles for
the process. Unfortunately, we close the handle for the process in
pgwin32_deadchild_callback(), which runs in a separate thread, so the pid can
be reused before we get to waitpid(). And thus it can happen while we start
new children.

I think we need to remove the CloseHandle() from
pgwin32_deadchild_callback(). Likely pgwin32_deadchild_callback() shouldn't do
anything other than
UnregisterWaitEx();PostQueuedCompletionStatus(key=childinfo),
pg_queue_signal(), with everything else moved to waitpid().

> I am yet to construct a reproduction of the case, but it seems to me that
> the race condition is not impossible here.

I suspect the issue could be made much more likely by adding a sleep before
the pg_queue_signal(SIGCHLD) in pgwin32_deadchild_callback().

Greetings,

Andres Freund

pgsql-hackers by date:

From: Andres Freund
Date: 14 March 2023, 00:59:19
Subject: Re: Microsecond-based timeouts

From: Jacob Champion
Date: 14 March 2023, 01:25:46
Subject: Re: Testing autovacuum wraparound (including failsafe)

Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED - Mailing list pgsql-hackers

Previous

Next