On Mon, Jan 30, 2023 at 1:53 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> So I did that - same configure options as the buildfarm client, and a
> 'make check' (with only tests up to the 'join' suite, because that's
> where it got stuck before). And it took only ~15 runs (~1h) to hit this
> again on dikkop.
That's good news.
> I managed to collect the fstat/procstat stuff Thomas asked for, and the
> backtraces - attached. I still have the core files, in case we look at
> something. As before, running gcore on the second worker (29081) gets
> this unstuck - it sends some signal that apparently wakes it up.
Thanks! As expected, no bytes in the pipe for any those processes.
Unfortunately I gave the wrong procstat command, it should be -i, not
-j. Does "procstat -i /path/to/core | grep USR1" show P (pending) for
that stuck process? Silly question really, I don't really expect
poll() to be misbehaving in such a basic way.
I was talking to Andres on IM about this yesterday and he pointed out
a potential out-of-order hazard: WaitEventSetWait() sets "waiting" (to
tell the signal handler to write to the self-pipe) and then reads
latch->is_set with neither compiler nor memory barrier, which doesn't
seem right because we might see a value of latch->is_set from before
"waiting" was true, and yet the signal handler might also have run
while "waiting" was false so the self-pipe doesn't save us, despite
the length of the comment about that. Can you reproduce it with this
change?
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -1011,6 +1011,7 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
* ordering, so that we cannot miss seeing is_set if a notificat
ion
* has already been queued.
*/
+ pg_memory_barrier();
if (set->latch && set->latch->is_set)
{
occurred_events->fd = PGINVALID_SOCKET;