Thread: psql crash with custom build on RedHat 7

psql crash with custom build on RedHat 7

From
Dominique Devienne
Date:
Hi. We've recently upgraded from libpq 15.2 to 16.1.
We custom build postgresql using the instructions and GCC 9.1 (from RH7's dts9).
We used the same process for building 15.2 and 16.1.
But somehow psql crashes on any backslash command, while 15.2 works fine.
I've included the small backtrace below.
I've used \conninfo, but \dn crashes just the same.
Regular SQL OTOH is fine (I tried that, before sending this email), so this is specific to backslash commands too.
At this point, we're not sure what's going on.
I've tried against a 14.8 server (as shown below), but also a 12.5 one, same results.
So it seems related to the client side and how it was compiled, not the server side.

16.1 (custom) built on Windows, or on RH8 with GCC 12 work fine OTOH.

Would anyone have a clue why 16.1 on RH7 would fail as shown below?
Were there any specific changes between 15.2 and 16.1 that could explain this behavior?

Another data-point, our own apps built using our custom-built libpq on RH7 (the same one used by psql, see ldd below) pass all their unit tests, and obviously no crashes.

Thanks for any help, clues, anything that might help. Thanks, --DD

[ddevienne@marsu SharedComponents]$ ldd .../postgresql/16.1/Linux_x64_2.17_gcc91/bin/psql
        linux-vdso.so.1 =>  (0x00007ffcf9cb1000)
        libpq.so.5 => .../postgresql/16.1/Linux_x64_2.17_gcc91//lib/libpq.so.5 (0x00007f278e8e6000)
        libreadline.so.6 => /lib64/libreadline.so.6 (0x00007f278e4d4000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f278e2b8000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f278e0b0000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f278ddae000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f278d9e1000)
        libssl.so.10 => /lib64/libssl.so.10 (0x00007f278d770000)
        libcrypto.so.10 => /lib64/libcrypto.so.10 (0x00007f278d30f000)
        libtinfo.so.5 => /lib64/libtinfo.so.5 (0x00007f278d0e5000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f278e71a000)
        libgssapi_krb5.so.2 => /lib64/libgssapi_krb5.so.2 (0x00007f278ce98000)
        libkrb5.so.3 => /lib64/libkrb5.so.3 (0x00007f278cbb0000)
        libcom_err.so.2 => /lib64/libcom_err.so.2 (0x00007f278c9ac000)
        libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x00007f278c779000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f278c575000)
        libz.so.1 => /lib64/libz.so.1 (0x00007f278c35f000)
        libkrb5support.so.0 => /lib64/libkrb5support.so.0 (0x00007f278c151000)
        libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x00007f278bf4d000)
        libresolv.so.2 => /lib64/libresolv.so.2 (0x00007f278bd34000)
        libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f278bb0d000)
        libpcre.so.1 => /lib64/libpcre.so.1 (0x00007f278b8ab000)
[ddevienne@marsu SharedComponents]$ cat /etc/redhat-release
Red Hat Enterprise Linux Workstation release 7.5 (Maipo)

[ddevienne@marsu SharedComponents]$ gdb .../postgresql/16.1/Linux_x64_2.17_gcc91/bin/psql
GNU gdb (GDB) Red Hat Enterprise Linux 8.3-3.el7
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from .../postgresql/16.1/Linux_x64_2.17_gcc91/bin/psql...
(No debugging symbols found in .../postgresql/16.1/Linux_x64_2.17_gcc91/bin/psql)
(gdb) run postgresql://ddevienne@db/migrated
Starting program: .../postgresql/16.1/Linux_x64_2.17_gcc91/bin/psql postgresql://ddevienne@db/migrated
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Password for user ddevienne:
psql (16.1, server 14.8)
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, compression: off)
Type "help" for help.

migrated=> \conninfo

Program received signal SIGSEGV, Segmentation fault.
0x00000000004232b8 in slash_yylex ()
Missing separate debuginfos, use: debuginfo-install keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-18.el7.x86_64 libcom_err-1.42.9-11.el7.x86_64 libselinux-2.5-12.el7.x86_64 ncurses-libs-5.9-14.20130511.el7_4.x86_64 openssl-libs-1.0.2k-12.el7.x86_64 pcre-8.32-17.el7.x86_64 readline-6.2-10.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  0x00000000004232b8 in slash_yylex ()
#1  0x000000000042456b in psql_scan_slash_command ()
#2  0x000000000040d56f in HandleSlashCmds ()
#3  0x0000000000421d63 in MainLoop ()
#4  0x0000000000405c5c in main ()
(gdb)

Re: psql crash with custom build on RedHat 7

From
Thomas Munro
Date:
On Wed, Dec 20, 2023 at 1:39 AM Dominique Devienne <ddevienne@gmail.com> wrote:
> Program received signal SIGSEGV, Segmentation fault.
> 0x00000000004232b8 in slash_yylex ()

I think this might have to do with flex changing.  Does it help if you
"make maintainer-clean"?



Re: psql crash with custom build on RedHat 7

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Wed, Dec 20, 2023 at 1:39 AM Dominique Devienne <ddevienne@gmail.com> wrote:
>> Program received signal SIGSEGV, Segmentation fault.
>> 0x00000000004232b8 in slash_yylex ()

> I think this might have to do with flex changing.  Does it help if you
> "make maintainer-clean"?

If that doesn't fix it, please build with --enable-debug so that you
can get a more detailed stack trace.

            regards, tom lane



Re: psql crash with custom build on RedHat 7

From
Dominique Devienne
Date:
On Tue, Dec 19, 2023 at 2:02 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Wed, Dec 20, 2023 at 1:39 AM Dominique Devienne <ddevienne@gmail.com> wrote:
> Program received signal SIGSEGV, Segmentation fault.
> 0x00000000004232b8 in slash_yylex ()

I think this might have to do with flex changing.  Does it help if you
"make maintainer-clean"?

My colleague who did the custom build double-checked the flex/bison requirements,
and the version of the packages on the RH7 machine he built on, and they check out (see below).

He also tells me he builds debug and release versions off different workspaces/checkouts,
thus there are no remnants of previous builds, assuming that's what `make maintainer-clean` is for.

Thanks, --DD

-------- From the Build Mgr -------------
Here https://www.postgresql.org/docs/current/install-requirements.html i read:
> Flex and Bison are needed to build from a Git checkout, or if you changed the actual scanner and parser definition files.
> If you need them, be sure to get Flex 2.5.35 or later and Bison 2.3 or later.

On the cf-re7-toolkits (RH7) machine I built postgresql 16.1, the system packages:

$ rpm -qa | grep flex
flex-2.5.37-3.el7.x86_64
$ rpm -qa | grep bison
bison-3.0.4-1.el7.x86_64

are installed in the system. So they look good

Re: psql crash with custom build on RedHat 7

From
Thomas Munro
Date:
On Wed, Dec 20, 2023 at 4:41 AM Dominique Devienne <ddevienne@gmail.com> wrote:
> On Tue, Dec 19, 2023 at 2:02 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> On Wed, Dec 20, 2023 at 1:39 AM Dominique Devienne <ddevienne@gmail.com> wrote:
>> > Program received signal SIGSEGV, Segmentation fault.
>> > 0x00000000004232b8 in slash_yylex ()
>>
>> I think this might have to do with flex changing.  Does it help if you
>> "make maintainer-clean"?
>
> My colleague who did the custom build double-checked the flex/bison requirements,
> and the version of the packages on the RH7 machine he built on, and they check out (see below).
>
> He also tells me he builds debug and release versions off different workspaces/checkouts,
> thus there are no remnants of previous builds, assuming that's what `make maintainer-clean` is for.

OK but be warned that if you're using tarballs, we shipped lexer
remnants in the tree (until
https://github.com/postgres/postgres/commit/721856ff, an interesting
commit to read).  The slash lexer is a kind of extension that (IIRC)
shares the same PsqlScanState (opaque pointer to private lexer state),
but if these two things are compiled to C by different flex versions,
they may contain non-identical 'struct yyguts_t' (and even if the
structs were identical, what the code does with them might still be
incompatible, but I guess the struct itself would be a good first
thing to look at along with the versions mentioned near the top of the
.c):

src/fe_utils/psqlscan.l -> psqlscan.c
src/bin/psql/psqlscanslash.l -> psqlscanslash.c

The usual "clean" doesn't remove those .c files in PG < 17, which
means that if your pipeline involves tarballs but you finished up
regenerating one of the files, or some other sequence involving
different flex versions, you could get that.  I've seen it myself on a
few systems, a decade ago when I guess flex rolled out an incompatible
change (maybe contemporaneous with RHEL7) and flex was upgraded
underneath my feet.  I remember that "maintainer-clean" (or maybe I'm
misremembering and it was "distclean") fixed it.



Re: psql crash with custom build on RedHat 7

From
Dominique Devienne
Date:
On Tue, Dec 19, 2023 at 7:58 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Wed, Dec 20, 2023 at 4:41 AM Dominique Devienne <ddevienne@gmail.com> wrote:
> On Tue, Dec 19, 2023 at 2:02 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> On Wed, Dec 20, 2023 at 1:39 AM Dominique Devienne <ddevienne@gmail.com> wrote:
>> > Program received signal SIGSEGV, Segmentation fault.
>> > 0x00000000004232b8 in slash_yylex ()

OK but be warned that if you're using tarballs, we shipped lexer remnants in the tree
[...] which means that if your pipeline involves tarballs but you finished up
regenerating one of the files, or some other sequence involving
different flex versions, you could get that.

Thanks. Very insightful. We got tarballs from https://www.postgresql.org/ftp/source/v16.1/,
thus it could have been it indeed. We've rebuilt from scratch again, and now things are back to normal.
I'm not 100% what fixed it exactly, since I get 2nd hand info only, and thus don't know for sure whether
`make maintainer-clean` was used or not. Could also have been a glitch in our SCM, we also forced a
resync on the workspace with 3rd parties (NFS mounted).

I any case, this is now resolved, albeit in a muddy way I'm afraid...

Really appreciate the help. Thanks, --DD