Re: "ERROR: latch already owned" on gharial - Mailing list pgsql-hackers
From | Soumyadeep Chakraborty |
---|---|
Subject | Re: "ERROR: latch already owned" on gharial |
Date | |
Msg-id | CAE-ML+-iXbkxeM6hO9gTU0wfkmqW=mMFKfmu4M9anG7Tz+f-Rg@mail.gmail.com Whole thread Raw |
In response to | Re: "ERROR: latch already owned" on gharial (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
Hey, Deeply appreciate both your input! On Thu, Feb 8, 2024 at 4:57 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Hmm, there is a pair of SpinLockAcquire() and SpinLockRelease() in > ProcKill(), before step 3 can happen. Comment in spin.h about > SpinLockAcquire/Release: > > > * Load and store operations in calling code are guaranteed not to be > > * reordered with respect to these operations, because they include a > > * compiler barrier. (Before PostgreSQL 9.5, callers needed to use a > > * volatile qualifier to access data protected by spinlocks.) > > That talks about a compiler barrier, though, not a memory barrier. But > looking at the implementations in s_lock.h, I believe they do act as > memory barrier, too. > > So you might indeed have that problem on 9.4, but AFAICS not on later > versions. Yes 9.4 does not have 0709b7ee72e, which I'm assuming you are referring to? Reading src/backend/storage/lmgr/README.barrier: For x86, to avoid reordering between a load and a store, we need something that prevents both CPU and compiler reordering. pg_memory_barrier() fits the bill. Here we can have pid X's read of latch->owner_pid=Y reordered to precede pid Y's store of latch->owner_pid = 0. The compiler barrier in S_UNLOCK() will prevent compiler reordering but not CPU reordering of the above. #define S_UNLOCK(lock) \ do { __asm__ __volatile__("" : : : "memory"); *(lock) = 0; } while (0) which is equivalent to a: #define pg_compiler_barrier_impl() __asm__ __volatile__("" ::: "memory") But maybe both CPU and memory reordering will be prevented by the tas() in S_LOCK() which does a lock and xchgb? Is the above acting as BOTH a compiler and CPU barrier, like the lock; addl stuff in pg_memory_barrier_impl()? If yes, then the picture would look like this: Pid Y in DisownLatch(), Pid X in OwnLatch() Y: LOAD latch->ownerPid ... Y: STORE latch->ownerPid = 0 ... // returning PGPROC to freeList Y:S_LOCK(ProcStructLock) <--- tas() prevents X: LOAD latch->ownerPid from preceding this ... ... <-------- X: LOAD latch->ownerPid can't get here anyway as spinlock is held ... Y:S_UNLOCK(ProcStructLock) ... X: S_LOCK(ProcStructLock) // to retrieve PGPROC from freeList ... X: S_UNLOCK(ProcStructLock) ... X: LOAD latch->ownerPid And this issue is not caused due to 9.4 missing 0709b7ee72e, which changed S_UNLOCK exclusively. If no, then we would need the patch that does an explicit pg_memory_barrier() at the end of DisownLatch() for PG HEAD. On Thu, Feb 8, 2024 at 1:41 PM Andres Freund <andres@anarazel.de> wrote: > Right. I wonder if the issue istead could be something similar to what was > fixed in 8fb13dd6ab5b and more generally in 97550c0711972a. If two procs go > through proc_exit() for the same process, you can get all kinds of weird > mixed up resource ownership. The bug fixed in 8fb13dd6ab5b wouldn't apply, > but it's pretty easy to introduce similar bugs in other places, so it seems > quite plausible that greenplum might have done so. We also did have more > proc_exit()s in signal handlers in older branches, so it might just be an > issue that also was present before. Hmm, the pids X and Y in the example provided upthread don't spawn off any children (like by calling system()) - they are just regular backends. So its not possible for them to receive TERM and try to proc_exit() w/ the same PGPROC. So that is not the issue, I guess? Regards, Soumyadeep (VMware)
pgsql-hackers by date: