Re: greenfly lwlock corruption in REL_14_STABLE and REL_15_STABLE - Mailing list pgsql-hackers
| From | Thomas Munro |
|---|---|
| Subject | Re: greenfly lwlock corruption in REL_14_STABLE and REL_15_STABLE |
| Date | |
| Msg-id | CA+hUKGKygkVVviC_=Z8DptsrmS7Dgh8gcHU_aejndAH_Oj6C4w@mail.gmail.com Whole thread Raw |
| In response to | Re: greenfly lwlock corruption in REL_14_STABLE and REL_15_STABLE ("Greg Burd" <greg@burd.me>) |
| List | pgsql-hackers |
On Fri, Dec 12, 2025 at 6:27 AM Greg Burd <greg@burd.me> wrote:
> On Wed, Dec 10, 2025, at 12:10 AM, Thomas Munro wrote:
> > Beginning a week ago, greenfly (RISC-V, Clang 20.1) has failed like
> > this in 5 of 8 runs of the pgbench tests on the two oldest branches:
>
> Hey Thomas, raising this. I should more closely monitor my farm animals. As greenfly is one of them and my login
nameis littered in the logs (gburd) I suppose I should dive into this.
Thanks! Since then each branch had a non-failing run, so the new
failure stats are 5 out of 10.
> > I will see if I can reproduce it or see something wrong under qemu,
> > but that'll take some time to set up...
>
> It'll take me far less time to reproduce than you. :)
Yeah. I have given up, after a couple of days of grinding through
REL_14_STABLE's src/bin/pgbench tests under qemu, with the same
optimisation level and the clang-20 package from Debian sid +
experimental. I've learned to use an ARM host when I want to weak
memory order effects of other RISC-ish architectures via qemu JIT
translation (well obviously not the *same* weak memory behavior but
not none at all), so I would have tried that next if you weren't
offering to use the real hardware...
It's a bit confusing that the C does "oldstate =
pg_atomic_sub_fetch_u32()" (that's the convenience wrapper that
computes and returns the *new* value), but that also makes sense for
how it's used in the following assertion.
Assuming the state was already corrupted before we got here, I wonder
if you'd learn something by asserting the expected state before the
change:
if (mode == LW_EXCLUSIVE)
Assert(pg_atomic_read_u32(&lock->state) & LW_VAL_EXCLUSIVE);
else
Assert(pg_atomic_read_u32(&lock->state) & LW_SHARED_MASK);
(Since we acquired the lock, a relaxed load should be guaranteed to
see a new enough value?)
While trying to imagine how it might have become corrupted and noting
that the recent change related to a flag that has some synchronisation
requirement with the non-atomic accesses to ->lwWaitXXX members
declared in proc.h, I wondered if that atomic-vs-non-atomic fencing
and reordering stuff might be involved, ie the waitlist spinlock is
acquired and then we load MyProc->lwWaiting in LWLockDequeueSelf, and
then perhaps corrupt LW_FLAG_HAS_WAITERS, or some other place like
that, but I havent' found that in the code gen yet and I don't have an
actual theory... If you can reproduce this, does anything change if
you mark them volatile?
Can you please attach the animal's lwlock.o from REL_14_STABLE? I'd
like to check if I'm even looking at the right code...
> I'll see what I can do to find the offending commit(s).
Thanks! Could also be environmental. Maybe the snow arrived and you
adjusted your thermostat? :-) But even then we'd still need to figure
out what's different on those branches... One difference is autoconf
vs meson. Both say -g -O2, but only Meson says -fPIC. Can you
reproduce it on later branches with configure/make?
pgsql-hackers by date: