Re: BUG #16990: Random PANIC in qemu user context - Mailing list pgsql-bugs

From Tom Lane
Subject Re: BUG #16990: Random PANIC in qemu user context
Date
Msg-id 3714052.1619980429@sss.pgh.pa.us
Whole thread Raw
In response to BUG #16990: Random PANIC in qemu user context  (PG Bug reporting form <noreply@postgresql.org>)
Responses Re: BUG #16990: Random PANIC in qemu user context  (Paul Guyot <pguyot@kallisys.net>)
List pgsql-bugs
PG Bug reporting form <noreply@postgresql.org> writes:
> Within GitHub Actions Workflow, a qemu chrooted environment is created from
> a RaspiOS lite image, within which latest availble postgresql is installed
> from apt (postgresql 11.11).
> Then tests of embedded software are executed, which includes creating a
> postgresql database and performing few benign operations (as far as
> PostgreSQL is concerned). Tests run perfectly fine in a desktop-like
> environment as well as on real devices.

> Within this qemu context, randomly yet quite frequently, postgresql
> PANICs.
> Latest log was the following :
> 2021-05-02 09:22:21.591 BST [15024] PANIC:  stuck spinlock detected at
> LWLockWaitListLock,
> /build/postgresql-11-rRyn74/postgresql-11-11.11/build/../src/backend/storage/lmgr/lwlock.c:832

Hm.  Looking at the lwlock.c source code, that's not actually a stuck
spinlock (in the sense of a loop around a TAS() call), but a loop
waiting for an LWLock's LW_FLAG_LOCKED bit to become clear.  It's
morally the same thing though, in that we don't expect the conflicting
lock to be held for more than a few instructions, so we just busy-wait
and delay until the lock can be obtained.

Seems like there are a few possible explanations:

1. Compiler bug generating incorrect code for the wait loop (e.g.,
failing to re-fetch the volatile variable each time though).  The
difficulty with this theory is that then you'd expect to see the same
freezeup in normal non-qemu execution.  But maybe qemu slows things
down enough that the window for contention on an LWLock can be hit,
whereas you'd hardly ever see that without qemu.  Seems unlikely,
but maybe it'd be worth disassembling LWLockWaitListLock to check.

2. qemu bug in emulating the atomic-update instructions that are
used to set/clear LW_FLAG_LOCKED.  This doesn't seem real probable
either, but maybe it's the most likely of a bad lot.

3. qemu is so slow that the spinlock delay times out.  I don't
believe this one either, mainly because we haven't seen it in
our own occasional uses of qemu; and if it were that slow it'd
be entirely unusable.  The spinlock timeout is normally multiple
seconds, which is several orders of magnitude longer than such
locks ought to be held.

4. Postgres bug causing the lock to never get released.  This theory
has the same problem as #1, ie you have to explain why it's not seen
in any other environment.

5. The lock does get released, but there are enough processes
contending for it that some process times out before it
successfully acquires the lock.  It's possible perhaps that that
could happen under a very high-load scenario, but that doesn't seem
like the category of test that would be sane to run under qemu.

Not sure what to tell you, other than "make sure qemu and your
build toolchain are up-to-date".

            regards, tom lane



pgsql-bugs by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: BUG #16986: reindex error on ltree index
Next
From: Paul Guyot
Date:
Subject: Re: BUG #16990: Random PANIC in qemu user context