Re: "ERROR: latch already owned" on gharial - Mailing list pgsql-hackers

From Andres Freund
Subject Re: "ERROR: latch already owned" on gharial
Date
Msg-id 20240208214114.cpkib3tnfypjcjau@awork3.anarazel.de
Whole thread Raw
In response to Re: "ERROR: latch already owned" on gharial  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: "ERROR: latch already owned" on gharial
List pgsql-hackers
Hi,

On 2024-02-08 14:57:47 +0200, Heikki Linnakangas wrote:
> On 08/02/2024 04:08, Soumyadeep Chakraborty wrote:
> > A possible ordering of events:
> > 
> > (1) DisownLatch() is called by pid Y during ProcKill() and the write for
> > latch->owner_pid = 0 is NOT yet flushed to shmem.
> > 
> > (2) The PGPROC object for pid Y is returned to the free list.
> > 
> > (3) Pid X sees the same PGPROC object on the free list and grabs it.
> > 
> > (4) Pid X does sanity check inside OwnLatch during InitProcess and
> > still sees the
> > old value of latch->owner_pid = Y (and not = 0), and trips the ERROR.
> > 
> > The above sequence of operations should apply to PG HEAD as well.
> > 
> > Suggestion:
> > 
> > Should we do a pg_memory_barrier() at the end of DisownLatch(), like in
> > ResetLatch(), like the one introduced in [3]? This would ensure that the write
> > latch->owner_pid = 0; is flushed to shmem. The attached patch does this.
> 
> Hmm, there is a pair of SpinLockAcquire() and SpinLockRelease() in
> ProcKill(), before step 3 can happen.

Right.  I wonder if the issue istead could be something similar to what was
fixed in 8fb13dd6ab5b and more generally in 97550c0711972a. If two procs go
through proc_exit() for the same process, you can get all kinds of weird
mixed up resource ownership.  The bug fixed in 8fb13dd6ab5b wouldn't apply,
but it's pretty easy to introduce similar bugs in other places, so it seems
quite plausible that greenplum might have done so.  We also did have more
proc_exit()s in signal handlers in older branches, so it might just be an
issue that also was present before.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Where can I find the doxyfile?
Next
From: Maiquel Grassi
Date:
Subject: RE: Psql meta-command conninfo+