Re: "ERROR: latch already owned" on gharial - Mailing list pgsql-hackers

From Soumyadeep Chakraborty
Subject Re: "ERROR: latch already owned" on gharial
Date
Msg-id CAE-ML+-iXbkxeM6hO9gTU0wfkmqW=mMFKfmu4M9anG7Tz+f-Rg@mail.gmail.com
Whole thread Raw
In response to Re: "ERROR: latch already owned" on gharial  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hey,

Deeply appreciate both your input!

On Thu, Feb 8, 2024 at 4:57 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Hmm, there is a pair of SpinLockAcquire() and SpinLockRelease() in
> ProcKill(), before step 3 can happen. Comment in spin.h about
> SpinLockAcquire/Release:
>
> >  *    Load and store operations in calling code are guaranteed not to be
> >  *    reordered with respect to these operations, because they include a
> >  *    compiler barrier.  (Before PostgreSQL 9.5, callers needed to use a
> >  *    volatile qualifier to access data protected by spinlocks.)
>
> That talks about a compiler barrier, though, not a memory barrier. But
> looking at the implementations in s_lock.h, I believe they do act as
> memory barrier, too.
>
> So you might indeed have that problem on 9.4, but AFAICS not on later
> versions.

Yes 9.4 does not have 0709b7ee72e, which I'm assuming you are referring to?

Reading src/backend/storage/lmgr/README.barrier: For x86, to avoid reordering
between a load and a store, we need something that prevents both CPU and
compiler reordering. pg_memory_barrier() fits the bill.

Here we can have pid X's read of latch->owner_pid=Y reordered to precede
pid Y's store of latch->owner_pid = 0. The compiler barrier in S_UNLOCK() will
prevent compiler reordering but not CPU reordering of the above.

#define S_UNLOCK(lock) \
do { __asm__ __volatile__("" : : : "memory");  *(lock) = 0; } while (0)
which is equivalent to a:
#define pg_compiler_barrier_impl() __asm__ __volatile__("" ::: "memory")

But maybe both CPU and memory reordering will be prevented by the tas() in
S_LOCK() which does a lock and xchgb?

Is the above acting as BOTH a compiler and CPU barrier, like the lock; addl
stuff in pg_memory_barrier_impl()?

If yes, then the picture would look like this:

Pid Y in DisownLatch(), Pid X in OwnLatch()

Y: LOAD latch->ownerPid
...
Y: STORE latch->ownerPid = 0
...
// returning PGPROC to freeList
Y:S_LOCK(ProcStructLock) <--- tas() prevents X: LOAD latch->ownerPid
from preceding this
...
... <-------- X: LOAD latch->ownerPid can't get here anyway as spinlock is held
...
Y:S_UNLOCK(ProcStructLock)
...
X: S_LOCK(ProcStructLock) // to retrieve PGPROC from freeList
...
X: S_UNLOCK(ProcStructLock)
...
X: LOAD latch->ownerPid

And this issue is not caused due to 9.4 missing 0709b7ee72e, which
changed S_UNLOCK
exclusively.

If no, then we would need the patch that does an explicit pg_memory_barrier()
at the end of DisownLatch() for PG HEAD.

On Thu, Feb 8, 2024 at 1:41 PM Andres Freund <andres@anarazel.de> wrote:

> Right.  I wonder if the issue istead could be something similar to what was
> fixed in 8fb13dd6ab5b and more generally in 97550c0711972a. If two procs go
> through proc_exit() for the same process, you can get all kinds of weird
> mixed up resource ownership.  The bug fixed in 8fb13dd6ab5b wouldn't apply,
> but it's pretty easy to introduce similar bugs in other places, so it seems
> quite plausible that greenplum might have done so.  We also did have more
> proc_exit()s in signal handlers in older branches, so it might just be an
> issue that also was present before.

Hmm, the pids X and Y in the example provided upthread don't spawn off any
children (like by calling system()) - they are just regular backends. So its
not possible for them to receive TERM and try to proc_exit() w/ the same
PGPROC. So that is not the issue, I guess?

Regards,
Soumyadeep (VMware)



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Small fix on query_id_enabled
Next
From: Alexander Lakhin
Date:
Subject: Re: failure in 019_replslot_limit