Thread: pltcl crash on recent macOS
A little while ago, the pltcl tests starting crashing for me on macOS. I don't know what had changed, but I suspect it was either an operating system update or something like an xcode update. Here is a backtrace: * frame #0: 0x00007ff7b0e61853 frame #1: 0x00007ff803a28751 libsystem_c.dylib`hash_search + 215 frame #2: 0x0000000110357700 pltcl.so`compile_pltcl_function(fn_oid=16418, tgreloid=0, is_event_trigger=false, pltrusted=true) at pltcl.c:1418:13 frame #3: 0x0000000110355d50 pltcl.so`pltcl_func_handler(fcinfo=0x00007fb6f1817028, call_state=0x00007ff7b0e61b80, pltrusted=true) at pltcl.c:814:12 ... Note that the hash_search call goes into some system library, not postgres. The command to link pltcl is: gcc ... -ltcl8.6 -lz -lpthread -framework CoreFoundation -lc -bundle_loader ../../../src/backend/postgres Notice the -lc in there. If I remove that, it works again. The -lc is explicitly added in src/pl/tcl/Makefile, so it's our own doing. I tracked this back, and it's been moved and rearranged in that makefile a number of time. The original addition was commit e3909672f12e0ddf3e202b824fda068ad2195ef2 Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Mon Dec 14 00:46:49 1998 Build pltcl.so correctly on platforms that want dependent shared libraries to be listed in the link command. Has anyone else seen this? Note, I'm using the tcl-tk package from Homebrew. The tcl installation provided by macOS itself no longer appears to work for linking against.
On Mon, Jun 13, 2022 at 6:53 PM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > frame #1: 0x00007ff803a28751 libsystem_c.dylib`hash_search + 215 > frame #2: 0x0000000110357700 > pltcl.so`compile_pltcl_function(fn_oid=16418, tgreloid=0, Hmm, I can’t reproduce that…. although that symbol is present in my libSystem.B.dylib according to dlsym() and callable from a simple program not linked to anything else, pltcl.so is apparently reaching postgres’s hash_search for me, based on the fact that make -C src/pl/tcl check succeeds and nm -m on pltcl.so shows it as "from executable". It would be interesting to see what nm -m shows for you. Archeological note: That hash_search stuff, header <strhash.h>, seems to have been copied from ancient FreeBSD before it was dropped upstream for the crime of polluting the global symbol namespace with junk[1]. It's been languishing in Apple's libc for at least 19 years[2], though, so I'm not sure why it's showing up suddenly as a problem for you now. > Note, I'm using the tcl-tk package from Homebrew. The tcl installation > provided by macOS itself no longer appears to work for linking against. I’m using tcl 8.6.12 installed by MacPorts on macOS 12.4, though, hmm, SDK 12.3. I see the explicit -lc when building pltcl.so, and I see that libSystem.B.dylib is explicitly mentioned here, whether or not I have -lc: % otool -L ./tmp_install/Users/tmunro/install/lib/postgresql/pltcl.so ./tmp_install/Users/tmunro/install/lib/postgresql/pltcl.so: /opt/local/lib/libtcl8.6.dylib (compatibility version 8.6.0, current version 8.6.12) /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1311.100.3) Here’s the complete link line: ccache cc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Werror=unguarded-availability-new -Wendif-labels -Wmissing-format-attribute -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -Wno-unused-command-line-argument -Wno-compound-token-split-by-macro -g -O0 -bundle -multiply_defined suppress -o pltcl.so pltcl.o -L../../../src/port -L../../../src/common -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX12.3.sdk -Wl,-dead_strip_dylibs -L/opt/local/lib -ltcl8.6 -lz -lpthread -framework CoreFoundation -lc -bundle_loader ../../../src/backend/postgres [1] https://github.com/freebsd/freebsd-src/commit/dc196afb2e58dd05cd66e2da44872bb3d619910f [2] https://github.com/apple-open-source-mirror/Libc/blame/master/stdlib/FreeBSD/strhash.c
Thomas Munro <thomas.munro@gmail.com> writes: > On Mon, Jun 13, 2022 at 6:53 PM Peter Eisentraut > <peter.eisentraut@enterprisedb.com> wrote: >> frame #1: 0x00007ff803a28751 libsystem_c.dylib`hash_search + 215 >> frame #2: 0x0000000110357700 >> pltcl.so`compile_pltcl_function(fn_oid=16418, tgreloid=0, > Hmm, I can’t reproduce that…. I can't either, although I'm using the macOS-provided Tcl code, which still works fine for me. (I grant that Apple might desupport that someday, but they haven't yet.) sifaka and longfin aren't unhappy either; although sifaka is close to identical to my laptop. Having said that, I wonder whether the position of the -bundle_loader switch in the command line is relevant to which way the hash_search reference is resolved. Seems like we could put it in front of the various -l options if that'd help. regards, tom lane
On 13.06.22 13:27, Thomas Munro wrote: > On Mon, Jun 13, 2022 at 6:53 PM Peter Eisentraut > <peter.eisentraut@enterprisedb.com> wrote: >> frame #1: 0x00007ff803a28751 libsystem_c.dylib`hash_search + 215 >> frame #2: 0x0000000110357700 >> pltcl.so`compile_pltcl_function(fn_oid=16418, tgreloid=0, > > Hmm, I can’t reproduce that…. although that symbol is present in my > libSystem.B.dylib according to dlsym() and callable from a simple > program not linked to anything else, pltcl.so is apparently reaching > postgres’s hash_search for me, based on the fact that make -C > src/pl/tcl check succeeds and nm -m on pltcl.so shows it as "from > executable". It would be interesting to see what nm -m shows for you. ... (undefined) external _get_call_result_type (from executable) (undefined) external _getmissingattr (from executable) (undefined) external _hash_create (from libSystem) (undefined) external _hash_search (from libSystem) ... > I’m using tcl 8.6.12 installed by MacPorts on macOS 12.4, though, hmm, > SDK 12.3. I see the explicit -lc when building pltcl.so, and I see > that libSystem.B.dylib is explicitly mentioned here, whether or not I > have -lc: > > % otool -L ./tmp_install/Users/tmunro/install/lib/postgresql/pltcl.so > ./tmp_install/Users/tmunro/install/lib/postgresql/pltcl.so: > /opt/local/lib/libtcl8.6.dylib (compatibility version 8.6.0, current > version 8.6.12) > /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current > version 1311.100.3) Looks the same here: pltcl.so: /usr/local/opt/tcl-tk/lib/libtcl8.6.dylib (compatibility version 8.6.0, current version 8.6.12) /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1311.100.3) > Here’s the complete link line: > > ccache cc -Wall -Wmissing-prototypes -Wpointer-arith > -Wdeclaration-after-statement -Werror=vla > -Werror=unguarded-availability-new -Wendif-labels > -Wmissing-format-attribute -Wcast-function-type -Wformat-security > -fno-strict-aliasing -fwrapv -Wno-unused-command-line-argument > -Wno-compound-token-split-by-macro -g -O0 -bundle -multiply_defined > suppress -o pltcl.so pltcl.o -L../../../src/port > -L../../../src/common -isysroot > /Library/Developer/CommandLineTools/SDKs/MacOSX12.3.sdk > -Wl,-dead_strip_dylibs -L/opt/local/lib -ltcl8.6 -lz -lpthread > -framework CoreFoundation -lc -bundle_loader > ../../../src/backend/postgres The difference is that I use CC=gcc-11. I have change to CC=cc, then it works (nm output shows "from executable"). So it's gcc that gets thrown off by the -lc.
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: > The difference is that I use CC=gcc-11. I have change to CC=cc, then it > works (nm output shows "from executable"). So it's gcc that gets thrown > off by the -lc. Hah, that makes sense. So does changing the option order help? regards, tom lane
On 13.06.22 18:01, Tom Lane wrote: > Having said that, I wonder whether the position of the -bundle_loader > switch in the command line is relevant to which way the hash_search > reference is resolved. Seems like we could put it in front of the > various -l options if that'd help. Switching the order of -bundle_loader and -lc did not help.
On Tue, Jun 14, 2022 at 8:21 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > The difference is that I use CC=gcc-11. I have change to CC=cc, then it > works (nm output shows "from executable"). So it's gcc that gets thrown > off by the -lc. Hrmph, I changed my CC to "ccache gcc-mp-11" (what MacPorts calls GCC 11), and I still can't reproduce the problem. I still get "(from executable)". In your original quote you showed "gcc", not "gcc-11", which (assuming it is found as /usr/bin/gcc) is just a little binary that redirects to clang... trying that, this time without ccache in the mix... and still no cigar. So something is different about GCC 11 from homebrew, or the linker invocation it produces under the covers, or the linker it's using?
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: > Switching the order of -bundle_loader and -lc did not help. Meh. Well, it was worth a try. I'd be okay with just dropping the -lc from pl/tcl/Makefile and seeing what the buildfarm says. The fact that we needed it in 1998 doesn't mean that we still need it on supported versions of Tcl; nor was it ever anything but a hack for us to be overriding what TCL_LIBS says. As a quick check, I tried it on prairiedog's host (which has the oldest Tcl installation I still have in captivity), and it seemed fine. regards, tom lane
On 13.06.22 23:32, Thomas Munro wrote: > Hrmph, I changed my CC to "ccache gcc-mp-11" (what MacPorts calls GCC > 11), and I still can't reproduce the problem. I still get "(from > executable)". In your original quote you showed "gcc", not "gcc-11", > which (assuming it is found as /usr/bin/gcc) is just a little binary > that redirects to clang... trying that, this time without ccache in > the mix... and still no cigar. So something is different about GCC 11 > from homebrew, or the linker invocation it produces under the covers, > or the linker it's using? The original quote said "gcc" but that just me attempting to simplify. I have now also figured out that it works with gcc-10 but not with gcc-11 and gcc-12. For example, below are the underlying linker invocations from gcc-10 and gcc-11. Note that some of the options are ordered quite differently. I don't know what all of that means yet, but it surely points to something in gcc or its packaging being the cause. However, I think ultimately the use of -lc is an error and we should get rid of it. This episode shows that it's very fragile in any case. "/usr/local/Cellar/gcc@10/10.3.0/libexec/gcc/x86_64-apple-darwin20/10.3.0/collect2" -dynamic -arch x86_64 -bundle -bundle_loader ../../../src/backend/postgres -macosx_version_min 11.4.0 -multiply_defined suppress -syslibroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX12.3.sdk -weak_reference_mismatches non-weak -o pltcl.so -L../../../src/port -L../../../src/common -L/usr/local/lib -L/usr/local/opt/openldap/lib "-L/usr/local/opt/openssl@1.1/lib" -L/usr/local/opt/readline/lib -L/usr/local/opt/krb5/lib -L/usr/local/opt/icu4c/lib -L/usr/local/opt/tcl-tk/lib -L/usr/local/Cellar/libxml2/2.9.14/lib -L/usr/local/Cellar/lz4/1.9.3/lib -L/usr/local/Cellar/zstd/1.5.2/lib -L/usr/local/Cellar/tcl-tk/8.6.12_1/lib "-L/usr/local/Cellar/gcc@10/10.3.0/lib/gcc/10/gcc/x86_64-apple-darwin20/10.3.0" "-L/usr/local/Cellar/gcc@10/10.3.0/lib/gcc/10/gcc/x86_64-apple-darwin20/10.3.0/../../.." pltcl.o -dead_strip_dylibs -ltcl8.6 -lz -framework CoreFoundation -lc -lSystem -lgcc_ext.10.5 -lgcc -lSystem -no_compact_unwind -idsym /usr/local/Cellar/gcc/11.3.0_1/bin/../libexec/gcc/x86_64-apple-darwin21/11/collect2 -dynamic -arch x86_64 -syslibroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX12.3.sdk -macosx_version_min 12.4.0 -o pltcl.so -L../../../src/port -L../../../src/common -L/usr/local/lib -L/usr/local/opt/openldap/lib "-L/usr/local/opt/openssl@1.1/lib" -L/usr/local/opt/readline/lib -L/usr/local/opt/krb5/lib -L/usr/local/opt/icu4c/lib -L/usr/local/opt/tcl-tk/lib -L/usr/local/Cellar/libxml2/2.9.14/lib -L/usr/local/Cellar/lz4/1.9.3/lib -L/usr/local/Cellar/zstd/1.5.2/lib -L/usr/local/Cellar/tcl-tk/8.6.12_1/lib -L/usr/local/Cellar/gcc/11.3.0_1/bin/../lib/gcc/11/gcc/x86_64-apple-darwin21/11 -L/usr/local/Cellar/gcc/11.3.0_1/bin/../lib/gcc/11/gcc -L/usr/local/Cellar/gcc/11.3.0_1/bin/../lib/gcc/11/gcc/x86_64-apple-darwin21/11/../../.. pltcl.o -dead_strip_dylibs -ltcl8.6 -lz -lc -bundle_loader ../../../src/backend/postgres -bundle -framework CoreFoundation -multiply_defined suppress -lemutls_w -lgcc -lSystem -no_compact_unwind -idsym
On 14.06.22 05:05, Tom Lane wrote: > I'd be okay with just dropping the -lc from pl/tcl/Makefile and seeing > what the buildfarm says. The fact that we needed it in 1998 doesn't > mean that we still need it on supported versions of Tcl; nor was it > ever anything but a hack for us to be overriding what TCL_LIBS says. Ok, I propose to proceed with the attached patch (with a bit more explanation added) for the master branch (for now) and see how it goes.
Attachment
On 20.06.22 12:36, Peter Eisentraut wrote: > On 14.06.22 05:05, Tom Lane wrote: >> I'd be okay with just dropping the -lc from pl/tcl/Makefile and seeing >> what the buildfarm says. The fact that we needed it in 1998 doesn't >> mean that we still need it on supported versions of Tcl; nor was it >> ever anything but a hack for us to be overriding what TCL_LIBS says. > > Ok, I propose to proceed with the attached patch (with a bit more > explanation added) for the master branch (for now) and see how it goes. done