Hi,
I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14)
did not report any results for a couple days, and it seems it got into
an infinite loop in REL_11_STABLE when building hash table in a parallel
hashjoin, or something like that.
It seems to be progressing now, probably because I attached gdb to the
workers to get backtraces, which does signals etc.
Anyway, in 'ps ax' I saw this:
94545 - Ss 0:03.39 postgres: buildfarm regression [local] SELECT
94627 - Is 0:00.03 postgres: parallel worker for PID 94545
94628 - Is 0:00.02 postgres: parallel worker for PID 94545
and the backend was stuck waiting on this query:
select final > 1 as multibatch
from hash_join_batches(
$$
select count(*) from join_foo
left join (select b1.id, b1.t from join_bar b1 join join_bar
b2 using (id)) ss
on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
$$);
This started on 2023-01-20 23:23:18.125, and the next log (after I did
the gdb stuff), is from 2023-01-26 20:05:16.751. Quite a bit of time.
It seems all three processes are doing WaitEventSetWait, either through
a ConditionVariable, or WaitLatch. But I don't have any good idea of
what might have broken - and as it got "unstuck" I can't investigate
more. But I see there's nodeHash and parallelism, and I recall there's a
lot of gotchas due to how the backends cooperate when building the hash
table, etc. Thomas, any idea what might be wrong?
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company