Thread: "ERROR: latch already owned" on gharial

"ERROR: latch already owned" on gharial

From
Thomas Munro
Date:
Hi,

A couple of recent isolation test failures reported $SUBJECT.

It could be a bug in recent-ish latch refactoring work, though I don't
know why it would show up twice just recently.

Just BTW, that animal has shown signs of a flaky toolchain before[1].
I know we have quite a lot of museum exhibits in the 'farm, in terms
of hardare, OS, and tool chain.  In some cases, they're probably just
forgotten/not on anyone's upgrade radar.  If they've shown signs of
misbehaving, maybe it's time to figure out if they can be upgraded?
For example, it'd be nice to be able to rule out problems in GCC 4.6.0
(that's like running PostgreSQL 9.1.0, in terms of vintage,
unsupported status, and long list of missing bugfixes from the time
when it was supported).

[1] https://www.postgresql.org/message-id/CA+hUKGJK5R0S1LL_W4vEzKxNQGY_xGAQ1XknR-WN9jqQeQtB_w@mail.gmail.com



Re: "ERROR: latch already owned" on gharial

From
Andres Freund
Date:
Hi,

On 2022-05-25 12:45:21 +1200, Thomas Munro wrote:
> A couple of recent isolation test failures reported $SUBJECT.

Was that just on gharial?


> It could be a bug in recent-ish latch refactoring work, though I don't
> know why it would show up twice just recently.

Yea, that's weird.


> Just BTW, that animal has shown signs of a flaky toolchain before[1].
> I know we have quite a lot of museum exhibits in the 'farm, in terms
> of hardare, OS, and tool chain.  In some cases, they're probably just
> forgotten/not on anyone's upgrade radar.  If they've shown signs of
> misbehaving, maybe it's time to figure out if they can be upgraded?
> For example, it'd be nice to be able to rule out problems in GCC 4.6.0
> (that's like running PostgreSQL 9.1.0, in terms of vintage,
> unsupported status, and long list of missing bugfixes from the time
> when it was supported).

Yea. gcc 4.6.0 is pretty ridiculous - the only thing we gain by testing with a
.0 compiler of that vintage is pain. Could it be upgraded?


TBH, I think we should just desupport HPUX. It's makework to support it at
this point. 11.31 v3 is about to be old enough to drink in quite a few
countries...

Greetings,

Andres Freund



Re: "ERROR: latch already owned" on gharial

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2022-05-25 12:45:21 +1200, Thomas Munro wrote:
>> I know we have quite a lot of museum exhibits in the 'farm, in terms
>> of hardare, OS, and tool chain.  In some cases, they're probably just
>> forgotten/not on anyone's upgrade radar.  If they've shown signs of
>> misbehaving, maybe it's time to figure out if they can be upgraded?

> TBH, I think we should just desupport HPUX.

I think there's going to be a significant die-off of old BF animals
when (if?) we convert over to the meson build system; it's just not
going to be worth the trouble to upgrade those platforms to be able
to run meson and ninja.  I'm inclined to wait until that's over and
see what's still standing before we make decisions about officially
desupporting things.

            regards, tom lane



Re: "ERROR: latch already owned" on gharial

From
Noah Misch
Date:
On Tue, May 24, 2022 at 06:24:39PM -0700, Andres Freund wrote:
> On 2022-05-25 12:45:21 +1200, Thomas Munro wrote:
> > Just BTW, that animal has shown signs of a flaky toolchain before[1].
> > I know we have quite a lot of museum exhibits in the 'farm, in terms
> > of hardare, OS, and tool chain.  In some cases, they're probably just
> > forgotten/not on anyone's upgrade radar.  If they've shown signs of
> > misbehaving, maybe it's time to figure out if they can be upgraded?
> > For example, it'd be nice to be able to rule out problems in GCC 4.6.0
> > (that's like running PostgreSQL 9.1.0, in terms of vintage,
> > unsupported status, and long list of missing bugfixes from the time
> > when it was supported).
> 
> Yea. gcc 4.6.0 is pretty ridiculous - the only thing we gain by testing with a
> .0 compiler of that vintage is pain. Could it be upgraded?

+1, this is at least the third non-obvious miscompilation from gharial.
Installing the latest GCC that builds easily (perhaps GCC 10.3) would make
this a good buildfarm member again.  If that won't happen, at least add a note
to the animal like described in
https://postgr.es/m/20211109144021.GD940092@rfd.leadboat.com



Re: "ERROR: latch already owned" on gharial

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> +1, this is at least the third non-obvious miscompilation from gharial.

Is there any evidence that this is a compiler-sourced problem?
Maybe it is, but it's sure not obvious to me (he says, eyeing his
buildfarm animals with even older gcc versions).

            regards, tom lane



Re: "ERROR: latch already owned" on gharial

From
Thomas Munro
Date:
On Thu, May 26, 2022 at 2:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Noah Misch <noah@leadboat.com> writes:
> > +1, this is at least the third non-obvious miscompilation from gharial.
>
> Is there any evidence that this is a compiler-sourced problem?
> Maybe it is, but it's sure not obvious to me (he says, eyeing his
> buildfarm animals with even older gcc versions).

Sorry for the ambiguity -- I have no evidence of miscompilation.  My
"just BTW" paragraph was a reaction to the memory of the last couple
of times Noah and I wasted hours chasing red herrings on this system,
which is pretty demotivating when looking into an unexplained failure.

On a more practical note, I don't have access to the BF database right
now.  Would you mind checking if "latch already owned" has occurred on
any other animals?



Re: "ERROR: latch already owned" on gharial

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> Sorry for the ambiguity -- I have no evidence of miscompilation.  My
> "just BTW" paragraph was a reaction to the memory of the last couple
> of times Noah and I wasted hours chasing red herrings on this system,
> which is pretty demotivating when looking into an unexplained failure.

I can't deny that those HPUX animals have produced more than their
fair share of problems.

> On a more practical note, I don't have access to the BF database right
> now.  Would you mind checking if "latch already owned" has occurred on
> any other animals?

Looking back 6 months, these are the only occurrences of that string
in failed tests:

 sysname | branch |      snapshot       |     stage      |                                 l
     

---------+--------+---------------------+----------------+-------------------------------------------------------------------
 gharial | HEAD   | 2022-04-28 23:37:51 | Check          | 2022-04-28 18:36:26.981 MDT [22642:1] ERROR:  latch already
owned
 gharial | HEAD   | 2022-05-06 11:33:11 | IsolationCheck | 2022-05-06 10:10:52.727 MDT [7366:1] ERROR:  latch already
owned
 gharial | HEAD   | 2022-05-24 06:31:31 | IsolationCheck | 2022-05-24 02:44:51.850 MDT [13089:1] ERROR:  latch already
owned
(3 rows)


            regards, tom lane



Re: "ERROR: latch already owned" on gharial

From
Thomas Munro
Date:
On Thu, May 26, 2022 at 2:35 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > On a more practical note, I don't have access to the BF database right
> > now.  Would you mind checking if "latch already owned" has occurred on
> > any other animals?
>
> Looking back 6 months, these are the only occurrences of that string
> in failed tests:
>
>  sysname | branch |      snapshot       |     stage      |                                 l
>
---------+--------+---------------------+----------------+-------------------------------------------------------------------
>  gharial | HEAD   | 2022-04-28 23:37:51 | Check          | 2022-04-28 18:36:26.981 MDT [22642:1] ERROR:  latch
alreadyowned
 
>  gharial | HEAD   | 2022-05-06 11:33:11 | IsolationCheck | 2022-05-06 10:10:52.727 MDT [7366:1] ERROR:  latch already
owned
>  gharial | HEAD   | 2022-05-24 06:31:31 | IsolationCheck | 2022-05-24 02:44:51.850 MDT [13089:1] ERROR:  latch
alreadyowned
 
> (3 rows)

Thanks.  Hmm.  So far it's always a parallel worker.  The best idea I
have is to include the ID of the mystery PID in the error message and
see if that provides a clue next time.

Attachment

Re: "ERROR: latch already owned" on gharial

From
Robert Haas
Date:
On Fri, May 27, 2022 at 7:55 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Thanks.  Hmm.  So far it's always a parallel worker.  The best idea I
> have is to include the ID of the mystery PID in the error message and
> see if that provides a clue next time.

What I'm inclined to do is get gharial and anole removed from the
buildfarm. anole was set up by Heikki in 2011. I don't know when
gharial was set up, or by whom. I don't think anyone at EDB cares
about these machines any more, or has any interest in maintaining
them. I think the only reason they're still running is that, just by
good fortune, they haven't fallen over and died yet. The hardest part
of getting them taken out of the buildfarm is likely to be finding
someone who has a working username and password to log into them and
take the jobs out of the crontab.

If someone really cares about figuring out what's going on here, it's
probably possible to get someone who is an EDB employee access to the
box to chase it down. But I'm having a hard time understanding what
value we get out of that given that the machines are running an
11-year-old compiler version on discontinued hardware on a
discontinued operating system. Even if we find a bug in PostgreSQL,
it's likely to be a bug that only matters on systems nobody cares
about.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: "ERROR: latch already owned" on gharial

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, May 27, 2022 at 7:55 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>> Thanks.  Hmm.  So far it's always a parallel worker.  The best idea I
>> have is to include the ID of the mystery PID in the error message and
>> see if that provides a clue next time.

> ... Even if we find a bug in PostgreSQL,
> it's likely to be a bug that only matters on systems nobody cares
> about.

That's possible, certainly.  It's also possible that it's a real bug
that so far has only manifested there for (say) timing reasons.
The buildfarm is not so large that we can write off single-machine
failures as being unlikely to hit in the real world.

What I'd suggest is to promote that failure to elog(PANIC), which
would at least give us the PID and if we're lucky a stack trace.

            regards, tom lane



Re: "ERROR: latch already owned" on gharial

From
Robert Haas
Date:
On Fri, May 27, 2022 at 10:21 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> That's possible, certainly.  It's also possible that it's a real bug
> that so far has only manifested there for (say) timing reasons.
> The buildfarm is not so large that we can write off single-machine
> failures as being unlikely to hit in the real world.
>
> What I'd suggest is to promote that failure to elog(PANIC), which
> would at least give us the PID and if we're lucky a stack trace.

That proposed change is fine with me.

As to the question of whether it's a real bug, nobody can prove
anything unless we actually run it down. It's just a question of what
you think the odds are. Noah's PGCon talk a few years back on the long
tail of buildfarm failures convinced me (perhaps unintentionally) that
low-probability failures that occur only on obscure systems or
configurations are likely not worth running down, because while they
COULD be real bugs, a lot of them aren't, and the time it would take
to figure it out could be spent on other things - for instance, fixing
things that we know for certain are bugs. Spending 40 hours of
person-time on something with a 10% chance of being a bug in the
PostgreSQL code doesn't necessarily make sense to me, because while
you are correct that the buildfarm isn't that large, neither is the
developer community.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: "ERROR: latch already owned" on gharial

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, May 27, 2022 at 10:21 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> What I'd suggest is to promote that failure to elog(PANIC), which
>> would at least give us the PID and if we're lucky a stack trace.

> That proposed change is fine with me.

> As to the question of whether it's a real bug, nobody can prove
> anything unless we actually run it down.

Agreed, and I'll even grant your point that if it is an HPUX-specific
or IA64-specific bug, it is not worth spending huge amounts of time
to isolate.  The problem is that we don't know that.  What we do know
so far is that if it can occur elsewhere, it's rare --- so we'd better
be prepared to glean as much info as possible if we do get such a
failure.  Hence my thought of s/ERROR/PANIC/.  And I'd be in favor of
any other low-effort change we can make to instrument the case better.

            regards, tom lane



Re: "ERROR: latch already owned" on gharial

From
Thomas Munro
Date:
On Sat, May 28, 2022 at 8:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Fri, May 27, 2022 at 10:21 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> What I'd suggest is to promote that failure to elog(PANIC), which
> >> would at least give us the PID and if we're lucky a stack trace.
>
> > That proposed change is fine with me.
>
> > As to the question of whether it's a real bug, nobody can prove
> > anything unless we actually run it down.
>
> Agreed, and I'll even grant your point that if it is an HPUX-specific
> or IA64-specific bug, it is not worth spending huge amounts of time
> to isolate.  The problem is that we don't know that.  What we do know
> so far is that if it can occur elsewhere, it's rare --- so we'd better
> be prepared to glean as much info as possible if we do get such a
> failure.  Hence my thought of s/ERROR/PANIC/.  And I'd be in favor of
> any other low-effort change we can make to instrument the case better.

OK, pushed (except I realised that all the PIDs involved were int, not
pid_t).  Let's see...



Re: "ERROR: latch already owned" on gharial

From
Thomas Munro
Date:
On Sat, May 28, 2022 at 1:56 AM Robert Haas <robertmhaas@gmail.com> wrote:
> What I'm inclined to do is get gharial and anole removed from the
> buildfarm. anole was set up by Heikki in 2011. I don't know when
> gharial was set up, or by whom. I don't think anyone at EDB cares
> about these machines any more, or has any interest in maintaining
> them. I think the only reason they're still running is that, just by
> good fortune, they haven't fallen over and died yet. The hardest part
> of getting them taken out of the buildfarm is likely to be finding
> someone who has a working username and password to log into them and
> take the jobs out of the crontab.

FWIW, in a previous investigation, Semab and Sandeep had access:

https://www.postgresql.org/message-id/CABimMB4mRs9N3eivR-%3DqF9M8oWc5E6OX7GywsWF0DXN4P5gNEA%40mail.gmail.com



Re: "ERROR: latch already owned" on gharial

From
Robert Haas
Date:
On Mon, May 30, 2022 at 8:31 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Sat, May 28, 2022 at 1:56 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > What I'm inclined to do is get gharial and anole removed from the
> > buildfarm. anole was set up by Heikki in 2011. I don't know when
> > gharial was set up, or by whom. I don't think anyone at EDB cares
> > about these machines any more, or has any interest in maintaining
> > them. I think the only reason they're still running is that, just by
> > good fortune, they haven't fallen over and died yet. The hardest part
> > of getting them taken out of the buildfarm is likely to be finding
> > someone who has a working username and password to log into them and
> > take the jobs out of the crontab.
>
> FWIW, in a previous investigation, Semab and Sandeep had access:
>
> https://www.postgresql.org/message-id/CABimMB4mRs9N3eivR-%3DqF9M8oWc5E6OX7GywsWF0DXN4P5gNEA%40mail.gmail.com

Yeah, I'm in touch with Sandeep but not able to get in yet for some
reason. Will try to sort it out.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: "ERROR: latch already owned" on gharial

From
Robert Haas
Date:
On Tue, May 31, 2022 at 8:20 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, May 30, 2022 at 8:31 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Sat, May 28, 2022 at 1:56 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > > What I'm inclined to do is get gharial and anole removed from the
> > > buildfarm. anole was set up by Heikki in 2011. I don't know when
> > > gharial was set up, or by whom. I don't think anyone at EDB cares
> > > about these machines any more, or has any interest in maintaining
> > > them. I think the only reason they're still running is that, just by
> > > good fortune, they haven't fallen over and died yet. The hardest part
> > > of getting them taken out of the buildfarm is likely to be finding
> > > someone who has a working username and password to log into them and
> > > take the jobs out of the crontab.
> >
> > FWIW, in a previous investigation, Semab and Sandeep had access:
> >
> > https://www.postgresql.org/message-id/CABimMB4mRs9N3eivR-%3DqF9M8oWc5E6OX7GywsWF0DXN4P5gNEA%40mail.gmail.com
>
> Yeah, I'm in touch with Sandeep but not able to get in yet for some
> reason. Will try to sort it out.

OK, I have access to the box now. I guess I might as well leave the
crontab jobs enabled until the next time this happens, since Thomas
just took steps to improve the logging, but I do think these BF
members are overdue to be killed off, and would like to do that as
soon as it seems like a reasonable step to take.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: "ERROR: latch already owned" on gharial

From
Thomas Munro
Date:
On Wed, Jun 1, 2022 at 12:55 AM Robert Haas <robertmhaas@gmail.com> wrote:
> OK, I have access to the box now. I guess I might as well leave the
> crontab jobs enabled until the next time this happens, since Thomas
> just took steps to improve the logging, but I do think these BF
> members are overdue to be killed off, and would like to do that as
> soon as it seems like a reasonable step to take.

A couple of months later, there has been no repeat of that error.  I'd
happily forget about that and move on, if you want to decommission
these.



Re: "ERROR: latch already owned" on gharial

From
Robert Haas
Date:
On Sun, Jul 3, 2022 at 11:51 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Jun 1, 2022 at 12:55 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > OK, I have access to the box now. I guess I might as well leave the
> > crontab jobs enabled until the next time this happens, since Thomas
> > just took steps to improve the logging, but I do think these BF
> > members are overdue to be killed off, and would like to do that as
> > soon as it seems like a reasonable step to take.
>
> A couple of months later, there has been no repeat of that error.  I'd
> happily forget about that and move on, if you want to decommission
> these.

I have commented out the BF stuff in crontab on that machine.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: "ERROR: latch already owned" on gharial

From
Sandeep Thakkar
Date:
Thanks Robert. 

We are receiving the alerts from buildfarm-admins for anole and gharial not reporting. Who can help to stop these? Thanks

On Wed, Jul 6, 2022 at 1:27 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Sun, Jul 3, 2022 at 11:51 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Jun 1, 2022 at 12:55 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > OK, I have access to the box now. I guess I might as well leave the
> > crontab jobs enabled until the next time this happens, since Thomas
> > just took steps to improve the logging, but I do think these BF
> > members are overdue to be killed off, and would like to do that as
> > soon as it seems like a reasonable step to take.
>
> A couple of months later, there has been no repeat of that error.  I'd
> happily forget about that and move on, if you want to decommission
> these.

I have commented out the BF stuff in crontab on that machine.

--
Robert Haas
EDB: http://www.enterprisedb.com




--
Sandeep Thakkar


Re: "ERROR: latch already owned" on gharial

From
Alvaro Herrera
Date:
On 2022-Jul-13, Sandeep Thakkar wrote:

> Thanks Robert.
> 
> We are receiving the alerts from buildfarm-admins for anole and gharial not
> reporting. Who can help to stop these? Thanks

Probably Andrew knows how to set buildsystems.no_alerts for these
animals.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"El hombre nunca sabe de lo que es capaz hasta que lo intenta" (C. Dickens)



Re: "ERROR: latch already owned" on gharial

From
Soumyadeep Chakraborty
Date:
Hey hackers,

I wanted to report that we have seen this issue (with the procLatch) a few
times very sporadically on Greenplum 6X (based on 9.4), with relatively newer
versions of GCC.

I realize that 9.4 is out of support, so this email is purely to add on to the
existing thread, in case the info can help fix/reveal something in supported
versions.

Unfortunately, we don't have a core to share as we don't have the benefit of
commit [1] in Greenplum 6X, but we do possess commit [2] which gives us an elog
ERROR as opposed to PANIC.

Instance 1:

Event 1: 2023-11-13 10:01:31.927168 CET..., pY,
..."LOG","00000","disconnection: session time: ..."
Event 2: 2023-11-13 10:01:32.049135
CET...,pX,,,,,"FATAL","XX000","latch already owned by pid Y (is_set:
0) (pg_latch.c:159)",,,,,,,0,,
"pg_latch.c",159,"Stack trace:
1    0xbde8b8 postgres errstart (elog.c:567)
2    0xbe0768 postgres elog_finish (discriminator 7)
3    0xa08924 postgres <symbol not found> (pg_latch.c:158) <---------- OwnLatch
4    0xa7f179 postgres InitProcess (proc.c:523)
5    0xa94ac3 postgres PostgresMain (postgres.c:4874)
6    0xa1e2ed postgres <symbol not found> (postmaster.c:2860)
7    0xa1f295 postgres PostmasterMain (discriminator 5)
...
"LOG","00000","server process (PID Y) exited with exit code
1",,,,,,,0,,"postmaster.c",3987,

Instance 2 (was reported with (GCC) 8.5.0 20210514 (Red Hat 8.5.0-20)):

Exactly the same as Instance 1 with identical log, ordering of events and stack
trace, except this time (is_set: 1) when the ERROR is logged.

A possible ordering of events:

(1) DisownLatch() is called by pid Y during ProcKill() and the write for
latch->owner_pid = 0 is NOT yet flushed to shmem.

(2) The PGPROC object for pid Y is returned to the free list.

(3) Pid X sees the same PGPROC object on the free list and grabs it.

(4) Pid X does sanity check inside OwnLatch during InitProcess and
still sees the
old value of latch->owner_pid = Y (and not = 0), and trips the ERROR.

The above sequence of operations should apply to PG HEAD as well.

Suggestion:

Should we do a pg_memory_barrier() at the end of DisownLatch(), like in
ResetLatch(), like the one introduced in [3]? This would ensure that the write
latch->owner_pid = 0; is flushed to shmem. The attached patch does this.

I'm not sure why we didn't introduce a memory barrier in DisownLatch() in [3].
I didn't find anything in the associated hackers thread [4] either. Was it the
performance impact, or was it just because SetLatch and ResetLatch
were more racy
and this is way less likely to happen?

This is out of my wheelhouse, but would one additional barrier in a process'
lifecycle be that bad for performance?

Appendix:

Build details: (GCC) 8.5.0 20210514 (Red Hat 8.5.0-20)

CFLAGS=-Wall -Wmissing-prototypes -Wpointer-arith -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv
-fexcess-precision=standard -fno-aggressive-loop-optimizations
-Wno-unused-but-set-variable -Wno-address -Werror=implicit-fallthrough=3
-Wno-format-truncation -Wno-stringop-truncation -m64 -O3
-fargument-noalias-global -fno-omit-frame-pointer -g -std=gnu99
-Werror=uninitialized -Werror=implicit-function-declaration

Regards,
Soumyadeep (VMware)

[1] https://github.com/postgres/postgres/commit/12e28aac8e8eb76cab13a4e9b696e3dab17f1c99
[2] https://github.com/greenplum-db/gpdb/commit/81fdd6c5219af865e9dc41f4087e0405d6616050
[3] https://github.com/postgres/postgres/commit/14e8803f101a54d99600683543b0f893a2e3f529
[4] https://www.postgresql.org/message-id/flat/20150112154026.GB2092%40awork2.anarazel.de

Attachment

Re: "ERROR: latch already owned" on gharial

From
Heikki Linnakangas
Date:
On 08/02/2024 04:08, Soumyadeep Chakraborty wrote:
> A possible ordering of events:
> 
> (1) DisownLatch() is called by pid Y during ProcKill() and the write for
> latch->owner_pid = 0 is NOT yet flushed to shmem.
> 
> (2) The PGPROC object for pid Y is returned to the free list.
> 
> (3) Pid X sees the same PGPROC object on the free list and grabs it.
> 
> (4) Pid X does sanity check inside OwnLatch during InitProcess and
> still sees the
> old value of latch->owner_pid = Y (and not = 0), and trips the ERROR.
> 
> The above sequence of operations should apply to PG HEAD as well.
> 
> Suggestion:
> 
> Should we do a pg_memory_barrier() at the end of DisownLatch(), like in
> ResetLatch(), like the one introduced in [3]? This would ensure that the write
> latch->owner_pid = 0; is flushed to shmem. The attached patch does this.

Hmm, there is a pair of SpinLockAcquire() and SpinLockRelease() in 
ProcKill(), before step 3 can happen. Comment in spin.h about 
SpinLockAcquire/Release:

>  *    Load and store operations in calling code are guaranteed not to be
>  *    reordered with respect to these operations, because they include a
>  *    compiler barrier.  (Before PostgreSQL 9.5, callers needed to use a
>  *    volatile qualifier to access data protected by spinlocks.)

That talks about a compiler barrier, though, not a memory barrier. But 
looking at the implementations in s_lock.h, I believe they do act as 
memory barrier, too.

So you might indeed have that problem on 9.4, but AFAICS not on later 
versions.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: "ERROR: latch already owned" on gharial

From
Andres Freund
Date:
Hi,

On 2024-02-08 14:57:47 +0200, Heikki Linnakangas wrote:
> On 08/02/2024 04:08, Soumyadeep Chakraborty wrote:
> > A possible ordering of events:
> > 
> > (1) DisownLatch() is called by pid Y during ProcKill() and the write for
> > latch->owner_pid = 0 is NOT yet flushed to shmem.
> > 
> > (2) The PGPROC object for pid Y is returned to the free list.
> > 
> > (3) Pid X sees the same PGPROC object on the free list and grabs it.
> > 
> > (4) Pid X does sanity check inside OwnLatch during InitProcess and
> > still sees the
> > old value of latch->owner_pid = Y (and not = 0), and trips the ERROR.
> > 
> > The above sequence of operations should apply to PG HEAD as well.
> > 
> > Suggestion:
> > 
> > Should we do a pg_memory_barrier() at the end of DisownLatch(), like in
> > ResetLatch(), like the one introduced in [3]? This would ensure that the write
> > latch->owner_pid = 0; is flushed to shmem. The attached patch does this.
> 
> Hmm, there is a pair of SpinLockAcquire() and SpinLockRelease() in
> ProcKill(), before step 3 can happen.

Right.  I wonder if the issue istead could be something similar to what was
fixed in 8fb13dd6ab5b and more generally in 97550c0711972a. If two procs go
through proc_exit() for the same process, you can get all kinds of weird
mixed up resource ownership.  The bug fixed in 8fb13dd6ab5b wouldn't apply,
but it's pretty easy to introduce similar bugs in other places, so it seems
quite plausible that greenplum might have done so.  We also did have more
proc_exit()s in signal handlers in older branches, so it might just be an
issue that also was present before.

Greetings,

Andres Freund



Re: "ERROR: latch already owned" on gharial

From
Soumyadeep Chakraborty
Date:
Hey,

Deeply appreciate both your input!

On Thu, Feb 8, 2024 at 4:57 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Hmm, there is a pair of SpinLockAcquire() and SpinLockRelease() in
> ProcKill(), before step 3 can happen. Comment in spin.h about
> SpinLockAcquire/Release:
>
> >  *    Load and store operations in calling code are guaranteed not to be
> >  *    reordered with respect to these operations, because they include a
> >  *    compiler barrier.  (Before PostgreSQL 9.5, callers needed to use a
> >  *    volatile qualifier to access data protected by spinlocks.)
>
> That talks about a compiler barrier, though, not a memory barrier. But
> looking at the implementations in s_lock.h, I believe they do act as
> memory barrier, too.
>
> So you might indeed have that problem on 9.4, but AFAICS not on later
> versions.

Yes 9.4 does not have 0709b7ee72e, which I'm assuming you are referring to?

Reading src/backend/storage/lmgr/README.barrier: For x86, to avoid reordering
between a load and a store, we need something that prevents both CPU and
compiler reordering. pg_memory_barrier() fits the bill.

Here we can have pid X's read of latch->owner_pid=Y reordered to precede
pid Y's store of latch->owner_pid = 0. The compiler barrier in S_UNLOCK() will
prevent compiler reordering but not CPU reordering of the above.

#define S_UNLOCK(lock) \
do { __asm__ __volatile__("" : : : "memory");  *(lock) = 0; } while (0)
which is equivalent to a:
#define pg_compiler_barrier_impl() __asm__ __volatile__("" ::: "memory")

But maybe both CPU and memory reordering will be prevented by the tas() in
S_LOCK() which does a lock and xchgb?

Is the above acting as BOTH a compiler and CPU barrier, like the lock; addl
stuff in pg_memory_barrier_impl()?

If yes, then the picture would look like this:

Pid Y in DisownLatch(), Pid X in OwnLatch()

Y: LOAD latch->ownerPid
...
Y: STORE latch->ownerPid = 0
...
// returning PGPROC to freeList
Y:S_LOCK(ProcStructLock) <--- tas() prevents X: LOAD latch->ownerPid
from preceding this
...
... <-------- X: LOAD latch->ownerPid can't get here anyway as spinlock is held
...
Y:S_UNLOCK(ProcStructLock)
...
X: S_LOCK(ProcStructLock) // to retrieve PGPROC from freeList
...
X: S_UNLOCK(ProcStructLock)
...
X: LOAD latch->ownerPid

And this issue is not caused due to 9.4 missing 0709b7ee72e, which
changed S_UNLOCK
exclusively.

If no, then we would need the patch that does an explicit pg_memory_barrier()
at the end of DisownLatch() for PG HEAD.

On Thu, Feb 8, 2024 at 1:41 PM Andres Freund <andres@anarazel.de> wrote:

> Right.  I wonder if the issue istead could be something similar to what was
> fixed in 8fb13dd6ab5b and more generally in 97550c0711972a. If two procs go
> through proc_exit() for the same process, you can get all kinds of weird
> mixed up resource ownership.  The bug fixed in 8fb13dd6ab5b wouldn't apply,
> but it's pretty easy to introduce similar bugs in other places, so it seems
> quite plausible that greenplum might have done so.  We also did have more
> proc_exit()s in signal handlers in older branches, so it might just be an
> issue that also was present before.

Hmm, the pids X and Y in the example provided upthread don't spawn off any
children (like by calling system()) - they are just regular backends. So its
not possible for them to receive TERM and try to proc_exit() w/ the same
PGPROC. So that is not the issue, I guess?

Regards,
Soumyadeep (VMware)