Thread: Remove pthread_is_threaded_np() checks in postmaster

Remove pthread_is_threaded_np() checks in postmaster

From
"Tristan Partin"
Date:
These checks are not effective for what they are trying to prevent.
A recent commit[0] in libcurl when used on macOS has been tripping the
pthread_is_threaded_np() check in postmaster.c for
shared_preload_libraries entries which use libcurl (like Neon). Under
the hood, libcurl calls SCDynamicStoreCopyProxies[1], which apparently
causes the check to fail.

Attached is a patch to remove the check, and a minimal reproducer
(curlexe.zip) for others to run on macOS.

Here are some logs:

Postgres working as expected:
> $ LC_ALL="C" /opt/homebrew/opt/postgresql@16/bin/postgres -D /opt/homebrew/var/postgresql@16
> 2024-01-22 23:18:51.203 GMT [50388] LOG:  starting PostgreSQL 16.1 (Homebrew) on aarch64-apple-darwin23.2.0, compiled
byApple clang version 15.0.0 (clang-1500.1.0.2.5), 64-bit 
> 2024-01-22 23:18:51.204 GMT [50388] LOG:  listening on IPv6 address "::1", port 5432
> 2024-01-22 23:18:51.204 GMT [50388] LOG:  listening on IPv4 address "127.0.0.1", port 5432
> 2024-01-22 23:18:51.205 GMT [50388] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
> 2024-01-22 23:18:51.207 GMT [50391] LOG:  database system was shut down at 2023-12-21 23:12:10 GMT
> 2024-01-22 23:18:51.211 GMT [50388] LOG:  database system is ready to accept connections
> ^C2024-01-22 23:18:53.797 GMT [50388] LOG:  received fast shutdown request
> 2024-01-22 23:18:53.798 GMT [50388] LOG:  aborting any active transactions
> 2024-01-22 23:18:53.800 GMT [50388] LOG:  background worker "logical replication launcher" (PID 50394) exited with
exitcode 1 
> 2024-01-22 23:18:53.801 GMT [50389] LOG:  shutting down
> 2024-01-22 23:18:53.801 GMT [50389] LOG:  checkpoint starting: shutdown immediate
> 2024-01-22 23:18:53.805 GMT [50389] LOG:  checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0
removed,0 recycled; write=0.002 s, sync=0.001 s, total=0.005 s; sync files=2, longest=0.001 s, average=0.001 s;
distance=0kB, estimate=0 kB; lsn=0/4BE77E0, redo lsn=0/4BE77E0 
> 2024-01-22 23:18:53.809 GMT [50388] LOG:  database system is shut down

Postgres not working with attached extension preloaded:
> $ echo shared_preload_libraries=curlexe >> /opt/homebrew/var/postgresql@16/postgresql.conf
> $ LC_ALL="C" /opt/homebrew/opt/postgresql@16/bin/postgres -D /opt/homebrew/var/postgresql@16
> 2024-01-22 23:19:01.108 GMT [50395] LOG:  starting PostgreSQL 16.1 (Homebrew) on aarch64-apple-darwin23.2.0, compiled
byApple clang version 15.0.0 (clang-1500.1.0.2.5), 64-bit 
> 2024-01-22 23:19:01.110 GMT [50395] LOG:  listening on IPv6 address "::1", port 5432
> 2024-01-22 23:19:01.110 GMT [50395] LOG:  listening on IPv4 address "127.0.0.1", port 5432
> 2024-01-22 23:19:01.111 GMT [50395] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
> 2024-01-22 23:19:01.113 GMT [50395] FATAL:  postmaster became multithreaded during startup
> 2024-01-22 23:19:01.113 GMT [50395] HINT:  Set the LC_ALL environment variable to a valid locale.
> 2024-01-22 23:19:01.114 GMT [50395] LOG:  database system is shut down

Same as previous, but without IPv6 support in libcurl:
> $ DYLD_LIBRARY_PATH=/opt/homebrew/opt/curl-without-ipv6/lib LC_ALL="C" /opt/homebrew/opt/postgresql@16/bin/postgres
-D/opt/homebrew/var/postgresql@16 
> 2024-01-22 23:23:17.151 GMT [50546] LOG:  starting PostgreSQL 16.1 (Homebrew) on aarch64-apple-darwin23.2.0, compiled
byApple clang version 15.0.0 (clang-1500.1.0.2.5), 64-bit 
> 2024-01-22 23:23:17.152 GMT [50546] LOG:  listening on IPv6 address "::1", port 5432
> 2024-01-22 23:23:17.152 GMT [50546] LOG:  listening on IPv4 address "127.0.0.1", port 5432
> 2024-01-22 23:23:17.152 GMT [50546] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
> 2024-01-22 23:23:17.155 GMT [50549] LOG:  database system was shut down at 2024-01-22 23:23:10 GMT
> 2024-01-22 23:23:17.158 GMT [50546] LOG:  database system is ready to accept connections
> ^C2024-01-22 23:23:26.997 GMT [50546] LOG:  received fast shutdown request
> 2024-01-22 23:23:26.998 GMT [50546] LOG:  aborting any active transactions
> 2024-01-22 23:23:27.000 GMT [50546] LOG:  background worker "logical replication launcher" (PID 50552) exited with
exitcode 1 
> 2024-01-22 23:23:27.000 GMT [50547] LOG:  shutting down
> 2024-01-22 23:23:27.001 GMT [50547] LOG:  checkpoint starting: shutdown immediate
> 2024-01-22 23:23:27.002 GMT [50547] LOG:  checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0
removed,0 recycled; write=0.001 s, sync=0.001 s, total=0.003 s; sync files=2, longest=0.001 s, average=0.001 s;
distance=0kB, estimate=0 kB; lsn=0/4BE78D0, redo lsn=0/4BE78D0 
> 2024-01-22 23:23:27.005 GMT [50546] LOG:  database system is shut down

[0]: https://github.com/curl/curl/commit/8b7cbe9decc205b08ec8258eb184c89a33e3084b
[1]: https://developer.apple.com/documentation/systemconfiguration/1517088-scdynamicstorecopyproxies

--
Tristan Partin
Neon (https://neon.tech)

Attachment

Re: Remove pthread_is_threaded_np() checks in postmaster

From
Andres Freund
Date:
Hi,
On 2024-01-23 13:20:15 -0600, Tristan Partin wrote:
> These checks are not effective for what they are trying to prevent. A recent
> commit[0] in libcurl when used on macOS has been tripping the
> pthread_is_threaded_np() check in postmaster.c for shared_preload_libraries
> entries which use libcurl (like Neon). Under the hood, libcurl calls
> SCDynamicStoreCopyProxies[1], which apparently causes the check to fail.

Maybe I'm missing something, but isn't that indicating the exact opposite,
namely that the check is precisely doing what it's intended?

Greetings,

Andres Freund



Re: Remove pthread_is_threaded_np() checks in postmaster

From
"Tristan Partin"
Date:
On Tue Jan 23, 2024 at 1:47 PM CST, Andres Freund wrote:
> Hi,
> On 2024-01-23 13:20:15 -0600, Tristan Partin wrote:
> > These checks are not effective for what they are trying to prevent. A recent
> > commit[0] in libcurl when used on macOS has been tripping the
> > pthread_is_threaded_np() check in postmaster.c for shared_preload_libraries
> > entries which use libcurl (like Neon). Under the hood, libcurl calls
> > SCDynamicStoreCopyProxies[1], which apparently causes the check to fail.
>
> Maybe I'm missing something, but isn't that indicating the exact opposite,
> namely that the check is precisely doing what it's intended?

The logic in the original commit seems sound:

> On Darwin, detect and report a multithreaded postmaster.
>
> Darwin --enable-nls builds use a substitute setlocale() that may start a
> thread.  Buildfarm member orangutan experienced BackendList corruption
> on account of different postmaster threads executing signal handlers
> simultaneously.  Furthermore, a multithreaded postmaster risks undefined
> behavior from sigprocmask() and fork().  Emit LOG messages about the
> problem and its workaround.  Back-patch to 9.0 (all supported versions).

Some reading from signal(7):

> A process-directed signal may be delivered to any one of the threads
> that does not currently have the signal blocked. If more than one of
> the threads has the signal unblocked, then the kernel chooses an
> arbitrary thread to which to deliver the signal.

So it sounds like the commit is trying to protect from the last
sentence.

And then forks only copy from their parent thread...

Please ignore this thread. I need to think of a better solution to this
problem.

--
Tristan Partin
Neon (https://neon.tech)



Re: Remove pthread_is_threaded_np() checks in postmaster

From
"Tristan Partin"
Date:
On Tue Jan 23, 2024 at 2:10 PM CST, Tristan Partin wrote:
> On Tue Jan 23, 2024 at 1:47 PM CST, Andres Freund wrote:
> > Hi,
> > On 2024-01-23 13:20:15 -0600, Tristan Partin wrote:
> > > These checks are not effective for what they are trying to prevent. A recent
> > > commit[0] in libcurl when used on macOS has been tripping the
> > > pthread_is_threaded_np() check in postmaster.c for shared_preload_libraries
> > > entries which use libcurl (like Neon). Under the hood, libcurl calls
> > > SCDynamicStoreCopyProxies[1], which apparently causes the check to fail.
> >
> > Maybe I'm missing something, but isn't that indicating the exact opposite,
> > namely that the check is precisely doing what it's intended?
>
> The logic in the original commit seems sound:
>
> > On Darwin, detect and report a multithreaded postmaster.
> >
> > Darwin --enable-nls builds use a substitute setlocale() that may start a
> > thread.  Buildfarm member orangutan experienced BackendList corruption
> > on account of different postmaster threads executing signal handlers
> > simultaneously.  Furthermore, a multithreaded postmaster risks undefined
> > behavior from sigprocmask() and fork().  Emit LOG messages about the
> > problem and its workaround.  Back-patch to 9.0 (all supported versions).
>
> Some reading from signal(7):
>
> > A process-directed signal may be delivered to any one of the threads
> > that does not currently have the signal blocked. If more than one of
> > the threads has the signal unblocked, then the kernel chooses an
> > arbitrary thread to which to deliver the signal.
>
> So it sounds like the commit is trying to protect from the last
> sentence.
>
> And then forks only copy from their parent thread...

What is keeping us from using pthread_sigmask(3) instead of
sigprocmask(2)? If an extension can guarantee that threads that get
launched by it don't interact with anything Postgres-related, would that
be enough to protect from any fork(2) related issues? In the OP example,
is there any harm in the thread that libcurl inadvertantly launches from
a Postgres perspective?

--
Tristan Partin
Neon (https://neon.tech)



Re: Remove pthread_is_threaded_np() checks in postmaster

From
Andres Freund
Date:
Hi,

On 2024-01-23 15:50:11 -0600, Tristan Partin wrote:
> What is keeping us from using pthread_sigmask(3) instead of sigprocmask(2)?

We would need to make sure to compile with threading support everywhere. One
issue is that on some platforms things get slower once you actually start
using pthreads.


> If an extension can guarantee that threads that get launched by it don't
> interact with anything Postgres-related, would that be enough to protect
> from any fork(2) related issues?

A fork() while threads are running is undefined behavior IIRC, and undefined
behavior isn't limited to a single thread. You'd definitely need to use
pthread_sigprocmask etc to address that aspect alone.

Greetings,

Andres Freund



Re: Remove pthread_is_threaded_np() checks in postmaster

From
"Tristan Partin"
Date:
On Tue Jan 23, 2024 at 4:23 PM CST, Andres Freund wrote:
> Hi,
>
> On 2024-01-23 15:50:11 -0600, Tristan Partin wrote:
> > What is keeping us from using pthread_sigmask(3) instead of sigprocmask(2)?
>
> We would need to make sure to compile with threading support everywhere. One
> issue is that on some platforms things get slower once you actually start
> using pthreads.

What are examples of these reduced performance platforms?

From reading the meson.build files, it seems like building with
threading enabled is the future, so should we just rip the band-aid off
for 17?

> > If an extension can guarantee that threads that get launched by it don't
> > interact with anything Postgres-related, would that be enough to protect
> > from any fork(2) related issues?
>
> A fork() while threads are running is undefined behavior IIRC, and undefined
> behavior isn't limited to a single thread. You'd definitely need to use
> pthread_sigprocmask etc to address that aspect alone.

If you can find a resource that explains the UB, I would be very
interested to read that. I found a SO[0] answer that made it seem like
this actually wasn't the case.

[0]: https://stackoverflow.com/a/42679479/7572728

--
Tristan Partin
Neon (https://neon.tech)



Re: Remove pthread_is_threaded_np() checks in postmaster

From
Andres Freund
Date:
Hi,

On 2024-01-23 17:26:19 -0600, Tristan Partin wrote:
> On Tue Jan 23, 2024 at 4:23 PM CST, Andres Freund wrote:
> > Hi,
> > 
> > On 2024-01-23 15:50:11 -0600, Tristan Partin wrote:
> > > What is keeping us from using pthread_sigmask(3) instead of sigprocmask(2)?
> > 
> > We would need to make sure to compile with threading support everywhere. One
> > issue is that on some platforms things get slower once you actually start
> > using pthreads.
> 
> What are examples of these reduced performance platforms?

With some libc, including IIRC glibc, FILE* style io starts to use locking,
for example. Which we likely shouldn't use as heavily as we do, but we
currently do use it for things like COPY.


> From reading the meson.build files, it seems like building with threading
> enabled is the future, so should we just rip the band-aid off for 17?

Building with -pthreads and using threads are separate things...


> > > If an extension can guarantee that threads that get launched by it don't
> > > interact with anything Postgres-related, would that be enough to protect
> > > from any fork(2) related issues?
> > 
> > A fork() while threads are running is undefined behavior IIRC, and undefined
> > behavior isn't limited to a single thread. You'd definitely need to use
> > pthread_sigprocmask etc to address that aspect alone.
> 
> If you can find a resource that explains the UB, I would be very interested
> to read that. I found a SO[0] answer that made it seem like this actually
> wasn't the case.

I think there are safe ways to do it, but I don't think we currently reliably
do so. It certainly wouldn't be well defined to have a thread created in
postmaster, before backends are forked off ("the child process may only
execute async-signal-safe operations until such time as one of the exec
functions is called").

Greetings,

Andres Freund



Re: Remove pthread_is_threaded_np() checks in postmaster

From
Thomas Munro
Date:
On Wed, Jan 24, 2024 at 1:39 PM Andres Freund <andres@anarazel.de> wrote:
> On 2024-01-23 17:26:19 -0600, Tristan Partin wrote:
> > On Tue Jan 23, 2024 at 4:23 PM CST, Andres Freund wrote:
> > > A fork() while threads are running is undefined behavior IIRC, and undefined
> > > behavior isn't limited to a single thread. You'd definitely need to use
> > > pthread_sigprocmask etc to address that aspect alone.
> >
> > If you can find a resource that explains the UB, I would be very interested
> > to read that. I found a SO[0] answer that made it seem like this actually
> > wasn't the case.
>
> I think there are safe ways to do it, but I don't think we currently reliably
> do so. It certainly wouldn't be well defined to have a thread created in
> postmaster, before backends are forked off ("the child process may only
> execute async-signal-safe operations until such time as one of the exec
> functions is called").

Right, the classic example is that if you fork() while some other
thread is in malloc() or fwrite() or whatever libc or other unknown
code it might hold a mutex that will never be released in the child.

As for what exactly might be happening in this case, I tried calling
SCDynamicStoreCopyProxies() and saw a new thread sitting in
__workq_kernreturn, which smells like libdispatch AKA GCD, Apple's
thread pool job dispatch thing.  I tried to step my way through and
follow along on Apple's github and saw plenty of uses of libdispatch
in CoreFoundation code, but not the culprit, and then I hit libxpc,
which is closed source so I lost interest.  Boo.  Then I found this
article that says some interesting stuff about all that:

https://www.evanjones.ca/fork-is-dangerous.html

That code has changed a bit since then but still tries to detect unsafe forks.

https://github.com/apple-oss-distributions/libdispatch/blob/main/src/init.c

These days, I don't think the original corruption complaint that led
to that am-I-multihreaded check being added to PostgreSQL could happen
anyway, because the postmaster would now process its state machine
serially in the main thread's work loop even if a random unexpected
thread happened to run the handler.  But obviously that doesn't help
us with these other complications so that observation isn't very
interesting.

As for sigprocmask() vs pthread_sigmask(), the sources of
unspecifiedness I am aware of are: (1) Unix vendors disagreeing on
whether the former affected only the calling thread or the whole
process before threads were standardised, and we can see that they
still differ today (eg Darwin/XNU loops over all threads setting them,
while many other systems do exactly the same as pthread_sigmask()),
and (2) libcs sometimes use or defer some signals for their own
private purposes, so sometimes pthread_sigmask() has a wrapper doing
some footwork in userspace rather than just invoking the system call,
but I dunno.  In one of the threads about avoiding bad behaviour
around system(), I think there might have been some ideas about
getting rid of the need to block signals at all, which I think must be
theoretically possible if the handlers are smart enough to avoid
misbehaving in child processes, and maybe we use moar latches.