Thread: Remove pthread_is_threaded_np() checks in postmaster
These checks are not effective for what they are trying to prevent. A recent commit[0] in libcurl when used on macOS has been tripping the pthread_is_threaded_np() check in postmaster.c for shared_preload_libraries entries which use libcurl (like Neon). Under the hood, libcurl calls SCDynamicStoreCopyProxies[1], which apparently causes the check to fail. Attached is a patch to remove the check, and a minimal reproducer (curlexe.zip) for others to run on macOS. Here are some logs: Postgres working as expected: > $ LC_ALL="C" /opt/homebrew/opt/postgresql@16/bin/postgres -D /opt/homebrew/var/postgresql@16 > 2024-01-22 23:18:51.203 GMT [50388] LOG: starting PostgreSQL 16.1 (Homebrew) on aarch64-apple-darwin23.2.0, compiled byApple clang version 15.0.0 (clang-1500.1.0.2.5), 64-bit > 2024-01-22 23:18:51.204 GMT [50388] LOG: listening on IPv6 address "::1", port 5432 > 2024-01-22 23:18:51.204 GMT [50388] LOG: listening on IPv4 address "127.0.0.1", port 5432 > 2024-01-22 23:18:51.205 GMT [50388] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" > 2024-01-22 23:18:51.207 GMT [50391] LOG: database system was shut down at 2023-12-21 23:12:10 GMT > 2024-01-22 23:18:51.211 GMT [50388] LOG: database system is ready to accept connections > ^C2024-01-22 23:18:53.797 GMT [50388] LOG: received fast shutdown request > 2024-01-22 23:18:53.798 GMT [50388] LOG: aborting any active transactions > 2024-01-22 23:18:53.800 GMT [50388] LOG: background worker "logical replication launcher" (PID 50394) exited with exitcode 1 > 2024-01-22 23:18:53.801 GMT [50389] LOG: shutting down > 2024-01-22 23:18:53.801 GMT [50389] LOG: checkpoint starting: shutdown immediate > 2024-01-22 23:18:53.805 GMT [50389] LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed,0 recycled; write=0.002 s, sync=0.001 s, total=0.005 s; sync files=2, longest=0.001 s, average=0.001 s; distance=0kB, estimate=0 kB; lsn=0/4BE77E0, redo lsn=0/4BE77E0 > 2024-01-22 23:18:53.809 GMT [50388] LOG: database system is shut down Postgres not working with attached extension preloaded: > $ echo shared_preload_libraries=curlexe >> /opt/homebrew/var/postgresql@16/postgresql.conf > $ LC_ALL="C" /opt/homebrew/opt/postgresql@16/bin/postgres -D /opt/homebrew/var/postgresql@16 > 2024-01-22 23:19:01.108 GMT [50395] LOG: starting PostgreSQL 16.1 (Homebrew) on aarch64-apple-darwin23.2.0, compiled byApple clang version 15.0.0 (clang-1500.1.0.2.5), 64-bit > 2024-01-22 23:19:01.110 GMT [50395] LOG: listening on IPv6 address "::1", port 5432 > 2024-01-22 23:19:01.110 GMT [50395] LOG: listening on IPv4 address "127.0.0.1", port 5432 > 2024-01-22 23:19:01.111 GMT [50395] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" > 2024-01-22 23:19:01.113 GMT [50395] FATAL: postmaster became multithreaded during startup > 2024-01-22 23:19:01.113 GMT [50395] HINT: Set the LC_ALL environment variable to a valid locale. > 2024-01-22 23:19:01.114 GMT [50395] LOG: database system is shut down Same as previous, but without IPv6 support in libcurl: > $ DYLD_LIBRARY_PATH=/opt/homebrew/opt/curl-without-ipv6/lib LC_ALL="C" /opt/homebrew/opt/postgresql@16/bin/postgres -D/opt/homebrew/var/postgresql@16 > 2024-01-22 23:23:17.151 GMT [50546] LOG: starting PostgreSQL 16.1 (Homebrew) on aarch64-apple-darwin23.2.0, compiled byApple clang version 15.0.0 (clang-1500.1.0.2.5), 64-bit > 2024-01-22 23:23:17.152 GMT [50546] LOG: listening on IPv6 address "::1", port 5432 > 2024-01-22 23:23:17.152 GMT [50546] LOG: listening on IPv4 address "127.0.0.1", port 5432 > 2024-01-22 23:23:17.152 GMT [50546] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" > 2024-01-22 23:23:17.155 GMT [50549] LOG: database system was shut down at 2024-01-22 23:23:10 GMT > 2024-01-22 23:23:17.158 GMT [50546] LOG: database system is ready to accept connections > ^C2024-01-22 23:23:26.997 GMT [50546] LOG: received fast shutdown request > 2024-01-22 23:23:26.998 GMT [50546] LOG: aborting any active transactions > 2024-01-22 23:23:27.000 GMT [50546] LOG: background worker "logical replication launcher" (PID 50552) exited with exitcode 1 > 2024-01-22 23:23:27.000 GMT [50547] LOG: shutting down > 2024-01-22 23:23:27.001 GMT [50547] LOG: checkpoint starting: shutdown immediate > 2024-01-22 23:23:27.002 GMT [50547] LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed,0 recycled; write=0.001 s, sync=0.001 s, total=0.003 s; sync files=2, longest=0.001 s, average=0.001 s; distance=0kB, estimate=0 kB; lsn=0/4BE78D0, redo lsn=0/4BE78D0 > 2024-01-22 23:23:27.005 GMT [50546] LOG: database system is shut down [0]: https://github.com/curl/curl/commit/8b7cbe9decc205b08ec8258eb184c89a33e3084b [1]: https://developer.apple.com/documentation/systemconfiguration/1517088-scdynamicstorecopyproxies -- Tristan Partin Neon (https://neon.tech)
Attachment
Hi, On 2024-01-23 13:20:15 -0600, Tristan Partin wrote: > These checks are not effective for what they are trying to prevent. A recent > commit[0] in libcurl when used on macOS has been tripping the > pthread_is_threaded_np() check in postmaster.c for shared_preload_libraries > entries which use libcurl (like Neon). Under the hood, libcurl calls > SCDynamicStoreCopyProxies[1], which apparently causes the check to fail. Maybe I'm missing something, but isn't that indicating the exact opposite, namely that the check is precisely doing what it's intended? Greetings, Andres Freund
On Tue Jan 23, 2024 at 1:47 PM CST, Andres Freund wrote: > Hi, > On 2024-01-23 13:20:15 -0600, Tristan Partin wrote: > > These checks are not effective for what they are trying to prevent. A recent > > commit[0] in libcurl when used on macOS has been tripping the > > pthread_is_threaded_np() check in postmaster.c for shared_preload_libraries > > entries which use libcurl (like Neon). Under the hood, libcurl calls > > SCDynamicStoreCopyProxies[1], which apparently causes the check to fail. > > Maybe I'm missing something, but isn't that indicating the exact opposite, > namely that the check is precisely doing what it's intended? The logic in the original commit seems sound: > On Darwin, detect and report a multithreaded postmaster. > > Darwin --enable-nls builds use a substitute setlocale() that may start a > thread. Buildfarm member orangutan experienced BackendList corruption > on account of different postmaster threads executing signal handlers > simultaneously. Furthermore, a multithreaded postmaster risks undefined > behavior from sigprocmask() and fork(). Emit LOG messages about the > problem and its workaround. Back-patch to 9.0 (all supported versions). Some reading from signal(7): > A process-directed signal may be delivered to any one of the threads > that does not currently have the signal blocked. If more than one of > the threads has the signal unblocked, then the kernel chooses an > arbitrary thread to which to deliver the signal. So it sounds like the commit is trying to protect from the last sentence. And then forks only copy from their parent thread... Please ignore this thread. I need to think of a better solution to this problem. -- Tristan Partin Neon (https://neon.tech)
On Tue Jan 23, 2024 at 2:10 PM CST, Tristan Partin wrote: > On Tue Jan 23, 2024 at 1:47 PM CST, Andres Freund wrote: > > Hi, > > On 2024-01-23 13:20:15 -0600, Tristan Partin wrote: > > > These checks are not effective for what they are trying to prevent. A recent > > > commit[0] in libcurl when used on macOS has been tripping the > > > pthread_is_threaded_np() check in postmaster.c for shared_preload_libraries > > > entries which use libcurl (like Neon). Under the hood, libcurl calls > > > SCDynamicStoreCopyProxies[1], which apparently causes the check to fail. > > > > Maybe I'm missing something, but isn't that indicating the exact opposite, > > namely that the check is precisely doing what it's intended? > > The logic in the original commit seems sound: > > > On Darwin, detect and report a multithreaded postmaster. > > > > Darwin --enable-nls builds use a substitute setlocale() that may start a > > thread. Buildfarm member orangutan experienced BackendList corruption > > on account of different postmaster threads executing signal handlers > > simultaneously. Furthermore, a multithreaded postmaster risks undefined > > behavior from sigprocmask() and fork(). Emit LOG messages about the > > problem and its workaround. Back-patch to 9.0 (all supported versions). > > Some reading from signal(7): > > > A process-directed signal may be delivered to any one of the threads > > that does not currently have the signal blocked. If more than one of > > the threads has the signal unblocked, then the kernel chooses an > > arbitrary thread to which to deliver the signal. > > So it sounds like the commit is trying to protect from the last > sentence. > > And then forks only copy from their parent thread... What is keeping us from using pthread_sigmask(3) instead of sigprocmask(2)? If an extension can guarantee that threads that get launched by it don't interact with anything Postgres-related, would that be enough to protect from any fork(2) related issues? In the OP example, is there any harm in the thread that libcurl inadvertantly launches from a Postgres perspective? -- Tristan Partin Neon (https://neon.tech)
Hi, On 2024-01-23 15:50:11 -0600, Tristan Partin wrote: > What is keeping us from using pthread_sigmask(3) instead of sigprocmask(2)? We would need to make sure to compile with threading support everywhere. One issue is that on some platforms things get slower once you actually start using pthreads. > If an extension can guarantee that threads that get launched by it don't > interact with anything Postgres-related, would that be enough to protect > from any fork(2) related issues? A fork() while threads are running is undefined behavior IIRC, and undefined behavior isn't limited to a single thread. You'd definitely need to use pthread_sigprocmask etc to address that aspect alone. Greetings, Andres Freund
On Tue Jan 23, 2024 at 4:23 PM CST, Andres Freund wrote: > Hi, > > On 2024-01-23 15:50:11 -0600, Tristan Partin wrote: > > What is keeping us from using pthread_sigmask(3) instead of sigprocmask(2)? > > We would need to make sure to compile with threading support everywhere. One > issue is that on some platforms things get slower once you actually start > using pthreads. What are examples of these reduced performance platforms? From reading the meson.build files, it seems like building with threading enabled is the future, so should we just rip the band-aid off for 17? > > If an extension can guarantee that threads that get launched by it don't > > interact with anything Postgres-related, would that be enough to protect > > from any fork(2) related issues? > > A fork() while threads are running is undefined behavior IIRC, and undefined > behavior isn't limited to a single thread. You'd definitely need to use > pthread_sigprocmask etc to address that aspect alone. If you can find a resource that explains the UB, I would be very interested to read that. I found a SO[0] answer that made it seem like this actually wasn't the case. [0]: https://stackoverflow.com/a/42679479/7572728 -- Tristan Partin Neon (https://neon.tech)
Hi, On 2024-01-23 17:26:19 -0600, Tristan Partin wrote: > On Tue Jan 23, 2024 at 4:23 PM CST, Andres Freund wrote: > > Hi, > > > > On 2024-01-23 15:50:11 -0600, Tristan Partin wrote: > > > What is keeping us from using pthread_sigmask(3) instead of sigprocmask(2)? > > > > We would need to make sure to compile with threading support everywhere. One > > issue is that on some platforms things get slower once you actually start > > using pthreads. > > What are examples of these reduced performance platforms? With some libc, including IIRC glibc, FILE* style io starts to use locking, for example. Which we likely shouldn't use as heavily as we do, but we currently do use it for things like COPY. > From reading the meson.build files, it seems like building with threading > enabled is the future, so should we just rip the band-aid off for 17? Building with -pthreads and using threads are separate things... > > > If an extension can guarantee that threads that get launched by it don't > > > interact with anything Postgres-related, would that be enough to protect > > > from any fork(2) related issues? > > > > A fork() while threads are running is undefined behavior IIRC, and undefined > > behavior isn't limited to a single thread. You'd definitely need to use > > pthread_sigprocmask etc to address that aspect alone. > > If you can find a resource that explains the UB, I would be very interested > to read that. I found a SO[0] answer that made it seem like this actually > wasn't the case. I think there are safe ways to do it, but I don't think we currently reliably do so. It certainly wouldn't be well defined to have a thread created in postmaster, before backends are forked off ("the child process may only execute async-signal-safe operations until such time as one of the exec functions is called"). Greetings, Andres Freund
On Wed, Jan 24, 2024 at 1:39 PM Andres Freund <andres@anarazel.de> wrote: > On 2024-01-23 17:26:19 -0600, Tristan Partin wrote: > > On Tue Jan 23, 2024 at 4:23 PM CST, Andres Freund wrote: > > > A fork() while threads are running is undefined behavior IIRC, and undefined > > > behavior isn't limited to a single thread. You'd definitely need to use > > > pthread_sigprocmask etc to address that aspect alone. > > > > If you can find a resource that explains the UB, I would be very interested > > to read that. I found a SO[0] answer that made it seem like this actually > > wasn't the case. > > I think there are safe ways to do it, but I don't think we currently reliably > do so. It certainly wouldn't be well defined to have a thread created in > postmaster, before backends are forked off ("the child process may only > execute async-signal-safe operations until such time as one of the exec > functions is called"). Right, the classic example is that if you fork() while some other thread is in malloc() or fwrite() or whatever libc or other unknown code it might hold a mutex that will never be released in the child. As for what exactly might be happening in this case, I tried calling SCDynamicStoreCopyProxies() and saw a new thread sitting in __workq_kernreturn, which smells like libdispatch AKA GCD, Apple's thread pool job dispatch thing. I tried to step my way through and follow along on Apple's github and saw plenty of uses of libdispatch in CoreFoundation code, but not the culprit, and then I hit libxpc, which is closed source so I lost interest. Boo. Then I found this article that says some interesting stuff about all that: https://www.evanjones.ca/fork-is-dangerous.html That code has changed a bit since then but still tries to detect unsafe forks. https://github.com/apple-oss-distributions/libdispatch/blob/main/src/init.c These days, I don't think the original corruption complaint that led to that am-I-multihreaded check being added to PostgreSQL could happen anyway, because the postmaster would now process its state machine serially in the main thread's work loop even if a random unexpected thread happened to run the handler. But obviously that doesn't help us with these other complications so that observation isn't very interesting. As for sigprocmask() vs pthread_sigmask(), the sources of unspecifiedness I am aware of are: (1) Unix vendors disagreeing on whether the former affected only the calling thread or the whole process before threads were standardised, and we can see that they still differ today (eg Darwin/XNU loops over all threads setting them, while many other systems do exactly the same as pthread_sigmask()), and (2) libcs sometimes use or defer some signals for their own private purposes, so sometimes pthread_sigmask() has a wrapper doing some footwork in userspace rather than just invoking the system call, but I dunno. In one of the threads about avoiding bad behaviour around system(), I think there might have been some ideas about getting rid of the need to block signals at all, which I think must be theoretically possible if the handlers are smart enough to avoid misbehaving in child processes, and maybe we use moar latches.