Re: "ERROR: latch already owned" on gharial - Mailing list pgsql-hackers
| From | Soumyadeep Chakraborty |
|---|---|
| Subject | Re: "ERROR: latch already owned" on gharial |
| Date | |
| Msg-id | CAE-ML+-iXbkxeM6hO9gTU0wfkmqW=mMFKfmu4M9anG7Tz+f-Rg@mail.gmail.com Whole thread |
| In response to | Re: "ERROR: latch already owned" on gharial (Andres Freund <andres@anarazel.de>) |
| List | pgsql-hackers |
Hey,
Deeply appreciate both your input!
On Thu, Feb 8, 2024 at 4:57 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Hmm, there is a pair of SpinLockAcquire() and SpinLockRelease() in
> ProcKill(), before step 3 can happen. Comment in spin.h about
> SpinLockAcquire/Release:
>
> > * Load and store operations in calling code are guaranteed not to be
> > * reordered with respect to these operations, because they include a
> > * compiler barrier. (Before PostgreSQL 9.5, callers needed to use a
> > * volatile qualifier to access data protected by spinlocks.)
>
> That talks about a compiler barrier, though, not a memory barrier. But
> looking at the implementations in s_lock.h, I believe they do act as
> memory barrier, too.
>
> So you might indeed have that problem on 9.4, but AFAICS not on later
> versions.
Yes 9.4 does not have 0709b7ee72e, which I'm assuming you are referring to?
Reading src/backend/storage/lmgr/README.barrier: For x86, to avoid reordering
between a load and a store, we need something that prevents both CPU and
compiler reordering. pg_memory_barrier() fits the bill.
Here we can have pid X's read of latch->owner_pid=Y reordered to precede
pid Y's store of latch->owner_pid = 0. The compiler barrier in S_UNLOCK() will
prevent compiler reordering but not CPU reordering of the above.
#define S_UNLOCK(lock) \
do { __asm__ __volatile__("" : : : "memory"); *(lock) = 0; } while (0)
which is equivalent to a:
#define pg_compiler_barrier_impl() __asm__ __volatile__("" ::: "memory")
But maybe both CPU and memory reordering will be prevented by the tas() in
S_LOCK() which does a lock and xchgb?
Is the above acting as BOTH a compiler and CPU barrier, like the lock; addl
stuff in pg_memory_barrier_impl()?
If yes, then the picture would look like this:
Pid Y in DisownLatch(), Pid X in OwnLatch()
Y: LOAD latch->ownerPid
...
Y: STORE latch->ownerPid = 0
...
// returning PGPROC to freeList
Y:S_LOCK(ProcStructLock) <--- tas() prevents X: LOAD latch->ownerPid
from preceding this
...
... <-------- X: LOAD latch->ownerPid can't get here anyway as spinlock is held
...
Y:S_UNLOCK(ProcStructLock)
...
X: S_LOCK(ProcStructLock) // to retrieve PGPROC from freeList
...
X: S_UNLOCK(ProcStructLock)
...
X: LOAD latch->ownerPid
And this issue is not caused due to 9.4 missing 0709b7ee72e, which
changed S_UNLOCK
exclusively.
If no, then we would need the patch that does an explicit pg_memory_barrier()
at the end of DisownLatch() for PG HEAD.
On Thu, Feb 8, 2024 at 1:41 PM Andres Freund <andres@anarazel.de> wrote:
> Right. I wonder if the issue istead could be something similar to what was
> fixed in 8fb13dd6ab5b and more generally in 97550c0711972a. If two procs go
> through proc_exit() for the same process, you can get all kinds of weird
> mixed up resource ownership. The bug fixed in 8fb13dd6ab5b wouldn't apply,
> but it's pretty easy to introduce similar bugs in other places, so it seems
> quite plausible that greenplum might have done so. We also did have more
> proc_exit()s in signal handlers in older branches, so it might just be an
> issue that also was present before.
Hmm, the pids X and Y in the example provided upthread don't spawn off any
children (like by calling system()) - they are just regular backends. So its
not possible for them to receive TERM and try to proc_exit() w/ the same
PGPROC. So that is not the issue, I guess?
Regards,
Soumyadeep (VMware)
pgsql-hackers by date: