On Fri, Jan 27, 2023 at 9:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
> > I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14)
> > did not report any results for a couple days, and it seems it got into
> > an infinite loop in REL_11_STABLE when building hash table in a parallel
> > hashjoin, or something like that.
>
> > It seems to be progressing now, probably because I attached gdb to the
> > workers to get backtraces, which does signals etc.
>
> That reminds me of cases that I saw several times on my now-deceased
> animal florican:
>
> https://www.postgresql.org/message-id/flat/2245838.1645902425%40sss.pgh.pa.us
>
> There's clearly something rotten somewhere in there, but whether
> it's our bug or FreeBSD's isn't clear.
And if it's ours, it's possibly in latch code and not anything higher
(I mean, not in condition variables, barriers, or parallel hash join)
because I saw a similar hang in the shm_mq stuff which uses the latch
API directly. Note that 13 switched to kqueue but still used the
self-pipe, and 14 switched to a signal event, and this hasn't been
reported in those releases or later, which makes the poll() code path
a key suspect.