Andres Freund <andres@anarazel.de> writes:
> On 2022-02-26 14:07:05 -0500, Tom Lane wrote:
>> I have observed this three times in the REL_11 branch, once
>> in REL_12, and a couple of times last summer before it occurred
>> to me to start keeping notes. Over that time the machine has
>> been running various patchlevels of FreeBSD 13.0.
> It's certainly interesting that it appears to happen only in the branches
> using poll rather than kqueue to implement latches. That changed between 12
> and 13.
Yeah, and there was no PHJ in v10, so that's a pretty good theory as
to why I've only seen it in those two branches.
> Have you tried running the core regression tests with force_parallel_mode =
> on, or with the parallel costs lowered, to see if that makes the problem
> appear more often?
> The next time this happens / if you still have this open, perhaps it could be
> worth checking if there's a byte in the self pipe?
> Besides trying to make the issue more likely as suggested above, it might be
> worth checking if signalling the stuck processes with SIGUSR1 gets them
> unstuck.
I've now wasted a bunch of kilowatt-hours fruitlessly trying to
reproduce this outside the confines of the buildfarm script.
I'm at a loss to figure out what the buildfarm is doing differently,
but apparently there's something. I'm going to re-enable the
machine's buildfarm job and just wait for it to hang up again.
More info eventually ...
regards, tom lane