Re: "ERROR: latch already owned" on gharial - Mailing list pgsql-hackers

From Soumyadeep Chakraborty
Subject Re: "ERROR: latch already owned" on gharial
Date
Msg-id CAE-ML+_CL3TfhLo6MjCSufinyugqSJWr8qEoWL8oAc-oT+P67g@mail.gmail.com
Whole thread Raw
In response to Re: "ERROR: latch already owned" on gharial  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Responses Re: "ERROR: latch already owned" on gharial  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-hackers
Hey hackers,

I wanted to report that we have seen this issue (with the procLatch) a few
times very sporadically on Greenplum 6X (based on 9.4), with relatively newer
versions of GCC.

I realize that 9.4 is out of support, so this email is purely to add on to the
existing thread, in case the info can help fix/reveal something in supported
versions.

Unfortunately, we don't have a core to share as we don't have the benefit of
commit [1] in Greenplum 6X, but we do possess commit [2] which gives us an elog
ERROR as opposed to PANIC.

Instance 1:

Event 1: 2023-11-13 10:01:31.927168 CET..., pY,
..."LOG","00000","disconnection: session time: ..."
Event 2: 2023-11-13 10:01:32.049135
CET...,pX,,,,,"FATAL","XX000","latch already owned by pid Y (is_set:
0) (pg_latch.c:159)",,,,,,,0,,
"pg_latch.c",159,"Stack trace:
1    0xbde8b8 postgres errstart (elog.c:567)
2    0xbe0768 postgres elog_finish (discriminator 7)
3    0xa08924 postgres <symbol not found> (pg_latch.c:158) <---------- OwnLatch
4    0xa7f179 postgres InitProcess (proc.c:523)
5    0xa94ac3 postgres PostgresMain (postgres.c:4874)
6    0xa1e2ed postgres <symbol not found> (postmaster.c:2860)
7    0xa1f295 postgres PostmasterMain (discriminator 5)
...
"LOG","00000","server process (PID Y) exited with exit code
1",,,,,,,0,,"postmaster.c",3987,

Instance 2 (was reported with (GCC) 8.5.0 20210514 (Red Hat 8.5.0-20)):

Exactly the same as Instance 1 with identical log, ordering of events and stack
trace, except this time (is_set: 1) when the ERROR is logged.

A possible ordering of events:

(1) DisownLatch() is called by pid Y during ProcKill() and the write for
latch->owner_pid = 0 is NOT yet flushed to shmem.

(2) The PGPROC object for pid Y is returned to the free list.

(3) Pid X sees the same PGPROC object on the free list and grabs it.

(4) Pid X does sanity check inside OwnLatch during InitProcess and
still sees the
old value of latch->owner_pid = Y (and not = 0), and trips the ERROR.

The above sequence of operations should apply to PG HEAD as well.

Suggestion:

Should we do a pg_memory_barrier() at the end of DisownLatch(), like in
ResetLatch(), like the one introduced in [3]? This would ensure that the write
latch->owner_pid = 0; is flushed to shmem. The attached patch does this.

I'm not sure why we didn't introduce a memory barrier in DisownLatch() in [3].
I didn't find anything in the associated hackers thread [4] either. Was it the
performance impact, or was it just because SetLatch and ResetLatch
were more racy
and this is way less likely to happen?

This is out of my wheelhouse, but would one additional barrier in a process'
lifecycle be that bad for performance?

Appendix:

Build details: (GCC) 8.5.0 20210514 (Red Hat 8.5.0-20)

CFLAGS=-Wall -Wmissing-prototypes -Wpointer-arith -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv
-fexcess-precision=standard -fno-aggressive-loop-optimizations
-Wno-unused-but-set-variable -Wno-address -Werror=implicit-fallthrough=3
-Wno-format-truncation -Wno-stringop-truncation -m64 -O3
-fargument-noalias-global -fno-omit-frame-pointer -g -std=gnu99
-Werror=uninitialized -Werror=implicit-function-declaration

Regards,
Soumyadeep (VMware)

[1] https://github.com/postgres/postgres/commit/12e28aac8e8eb76cab13a4e9b696e3dab17f1c99
[2] https://github.com/greenplum-db/gpdb/commit/81fdd6c5219af865e9dc41f4087e0405d6616050
[3] https://github.com/postgres/postgres/commit/14e8803f101a54d99600683543b0f893a2e3f529
[4] https://www.postgresql.org/message-id/flat/20150112154026.GB2092%40awork2.anarazel.de

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: glibc qsort() vulnerability
Next
From: Thomas Munro
Date:
Subject: Re: glibc qsort() vulnerability