lockup in parallel hash join on dikkop (freebsd 14.0-current) - Mailing list pgsql-hackers

From Tomas Vondra
Subject lockup in parallel hash join on dikkop (freebsd 14.0-current)
Date
Msg-id b2bc5c16-899e-ca99-26ed-e623b4259ec7@enterprisedb.com
Whole thread Raw
Responses Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)
List pgsql-hackers
Hi,

I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14)
did not report any results for a couple days, and it seems it got into
an infinite loop in REL_11_STABLE when building hash table in a parallel
hashjoin, or something like that.

It seems to be progressing now, probably because I attached gdb to the
workers to get backtraces, which does signals etc.

Anyway, in 'ps ax' I saw this:

94545  -  Ss       0:03.39 postgres: buildfarm regression [local] SELECT
94627  -  Is       0:00.03 postgres: parallel worker for PID 94545
94628  -  Is       0:00.02 postgres: parallel worker for PID 94545

and the backend was stuck waiting on this query:

    select final > 1 as multibatch
          from hash_join_batches(
        $$
          select count(*) from join_foo
            left join (select b1.id, b1.t from join_bar b1 join join_bar
b2 using (id)) ss
            on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
        $$);

This started on 2023-01-20 23:23:18.125, and the next log (after I did
the gdb stuff), is from 2023-01-26 20:05:16.751. Quite a bit of time.

It seems all three processes are doing WaitEventSetWait, either through
a ConditionVariable, or WaitLatch. But I don't have any good idea of
what might have broken - and as it got "unstuck" I can't investigate
more. But I see there's nodeHash and parallelism, and I recall there's a
lot of gotchas due to how the backends cooperate when building the hash
table, etc. Thomas, any idea what might be wrong?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: New strategies for freezing, advancing relfrozenxid early
Next
From: Tom Lane
Date:
Subject: Re: wrong Append/MergeAppend elision?