Re: lockup in parallel hash join on dikkop (freebsd 14.0-current) - Mailing list pgsql-hackers

From Alexander Lakhin
Subject Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)
Date
Msg-id 60bb34ad-a696-c43d-3f7c-1696796e86ce@gmail.com
Whole thread Raw
In response to Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)
Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)
List pgsql-hackers
Hello Thomas,

31.08.2023 14:15, Thomas Munro wrote:

> We have a signal that is pending and not blocked, so I don't
> immediately know why poll() hasn't returned control.

When I worked at the Postgres Pro company, we observed a similar lockup
under rather specific conditions (we used Elbrus CPU and the specific Elbrus
compiler (lcc) based on edg).
I managed to reproduce that lockup and Anton Voloshin investigated it.
The issue was caused by the compiler optimization in WaitEventSetWait():
     waiting = true;
...
     while (returned_events == 0)
     {
...
         if (set->latch && set->latch->is_set)
         {
...
             break;
         }

In that case, compiler decided that it may place the read
"set->latch->is_set" before the write "waiting = true".
(Placing "pg_compiler_barrier();" just after "waiting = true;" fixed the
issue for us.)
I can't provide more details for now, but maybe you could look at the binary
code generated on the target platform to confirm or reject my guess.

Best regards,
Alexander



pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: [PoC] pg_upgrade: allow to upgrade publisher node
Next
From: Krishnakumar R
Date:
Subject: Move bki file pre-processing from initdb to bootstrap