Re: Server crash on RHEL 9/s390x platform against PG16 - Mailing list pgsql-hackers

From Suraj Kharage
Subject Re: Server crash on RHEL 9/s390x platform against PG16
Date
Msg-id CAF1DzPUV9zhJNXr_npGrZCi3d+__Ob4F1bZx0g4k80zK5_3muA@mail.gmail.com
Whole thread Raw
In response to Re: Server crash on RHEL 9/s390x platform against PG16  (Suraj Kharage <suraj.kharage@enterprisedb.com>)
Responses Re: Server crash on RHEL 9/s390x platform against PG16
Re: Server crash on RHEL 9/s390x platform against PG16
List pgsql-hackers
It looks like an issue with JIT. If I disable the JIT then the above query runs successfully.

postgres=# set jit to off;

SET

postgres=# SELECT * FROM rm32044_t1 LEFT JOIN rm32044_t2 ON rm32044_t1.pkey = rm32044_t2.pkey, rm32044_t3 LEFT JOIN rm32044_t4 ON rm32044_t3.pkey = rm32044_t4.pkey order by rm32044_t1.pkey,label,hidden;

 pkey | val  | pkey |  label  | hidden | pkey | val | pkey 

------+------+------+---------+--------+------+-----+------

    1 | row1 |    1 | hidden  | t      |    1 |   1 |     

    1 | row1 |    1 | hidden  | t      |    2 |   1 |     

    2 | row2 |    2 | visible | f      |    1 |   1 |     

    2 | row2 |    2 | visible | f      |    2 |   1 |     

(4 rows)
Any idea on this?


On Mon, Sep 18, 2023 at 11:20 AM Suraj Kharage <suraj.kharage@enterprisedb.com> wrote:
Few more details on this:

(gdb) p val
$1 = 0
(gdb) p i
$2 = 3
(gdb) f 3
#3  0x0000000001a1ef70 in ExecCopySlotMinimalTuple (slot=0x202e4f8) at ../../../../src/include/executor/tuptable.h:472
472 return slot->tts_ops->copy_minimal_tuple(slot);
(gdb) p *slot
$3 = {type = T_TupleTableSlot, tts_flags = 16, tts_nvalid = 8, tts_ops = 0x1b6dcc8 <TTSOpsVirtual>, tts_tupleDescriptor = 0x202e0e8, tts_values = 0x202e540, tts_isnull = 0x202e580, tts_mcxt = 0x1f54550, tts_tid = {ip_blkid = {bi_hi = 65535,
      bi_lo = 65535}, ip_posid = 0}, tts_tableOid = 0}
(gdb) p *slot->tts_tupleDescriptor
$2 = {natts = 8, tdtypeid = 2249, tdtypmod = -1, tdrefcount = -1, constr = 0x0, attrs = 0x202cd28}

(gdb) p slot.tts_values[3]
$4 = 0
(gdb) p slot.tts_values[2]
$5 = 1
(gdb) p slot.tts_values[1]
$6 = 34027556


As per the resultslot, it has 0 value for the third attribute (column lable).
Im testing this on the docker container and facing some issues with gdb hence could not able to debug it further.

Here is a explain plan:

postgres=# explain (verbose, costs off) SELECT * FROM rm32044_t1 LEFT JOIN rm32044_t2 ON rm32044_t1.pkey = rm32044_t2.pkey, rm32044_t3 LEFT JOIN rm32044_t4 ON rm32044_t3.pkey = rm32044_t4.pkey order by rm32044_t1.pkey,label,hidden;
                                                                       QUERY PLAN                                                                        
---------------------------------------------------------------------------------------------------------------------------------------------------------
 Incremental Sort
   Output: rm32044_t1.pkey, rm32044_t1.val, rm32044_t2.pkey, rm32044_t2.label, rm32044_t2.hidden, rm32044_t3.pkey, rm32044_t3.val, rm32044_t4.pkey
   Sort Key: rm32044_t1.pkey, rm32044_t2.label, rm32044_t2.hidden
   Presorted Key: rm32044_t1.pkey
   ->  Merge Left Join
         Output: rm32044_t1.pkey, rm32044_t1.val, rm32044_t2.pkey, rm32044_t2.label, rm32044_t2.hidden, rm32044_t3.pkey, rm32044_t3.val, rm32044_t4.pkey
         Merge Cond: (rm32044_t1.pkey = rm32044_t2.pkey)
         ->  Sort
               Output: rm32044_t3.pkey, rm32044_t3.val, rm32044_t4.pkey, rm32044_t1.pkey, rm32044_t1.val
               Sort Key: rm32044_t1.pkey
               ->  Nested Loop
                     Output: rm32044_t3.pkey, rm32044_t3.val, rm32044_t4.pkey, rm32044_t1.pkey, rm32044_t1.val
                     ->  Merge Left Join
                           Output: rm32044_t3.pkey, rm32044_t3.val, rm32044_t4.pkey
                           Merge Cond: (rm32044_t3.pkey = rm32044_t4.pkey)
                           ->  Sort
                                 Output: rm32044_t3.pkey, rm32044_t3.val
                                 Sort Key: rm32044_t3.pkey
                                 ->  Seq Scan on public.rm32044_t3
                                       Output: rm32044_t3.pkey, rm32044_t3.val
                           ->  Sort
                                 Output: rm32044_t4.pkey
                                 Sort Key: rm32044_t4.pkey
                                 ->  Seq Scan on public.rm32044_t4
                                       Output: rm32044_t4.pkey
                     ->  Materialize
                           Output: rm32044_t1.pkey, rm32044_t1.val
                           ->  Seq Scan on public.rm32044_t1
                                 Output: rm32044_t1.pkey, rm32044_t1.val
         ->  Sort
               Output: rm32044_t2.pkey, rm32044_t2.label, rm32044_t2.hidden
               Sort Key: rm32044_t2.pkey
               ->  Seq Scan on public.rm32044_t2
                     Output: rm32044_t2.pkey, rm32044_t2.label, rm32044_t2.hidden
(34 rows)


It seems like while building the innerslot for merge join, the value for attnum 1 is not getting fetched correctly.

On Tue, Sep 12, 2023 at 3:27 PM Suraj Kharage <suraj.kharage@enterprisedb.com> wrote:
Hi,

Found server crash on RHEL 9/s390x platform with below test case - 

Machine details:
[edb@9428da9d2137 postgres]$ cat /etc/redhat-release
AlmaLinux release 9.2 (Turquoise Kodkod)
[edb@9428da9d2137 postgres]$ lscpu
Architecture:           s390x
  CPU op-mode(s):       32-bit, 64-bit
  Address sizes:        39 bits physical, 48 bits virtual
  Byte Order:           Big Endian

Configure command:
./configure --prefix=/home/edb/postgres/ --with-lz4 --with-zstd --with-llvm --with-perl --with-python --with-tcl --with-openssl --enable-nls --with-libxml --with-libxslt --with-systemd --with-libcurl --without-icu --enable-debug --enable-cassert --with-pgport=5414


Test case:
CREATE TABLE rm32044_t1
(
    pkey   integer,
    val  text
);
CREATE TABLE rm32044_t2
(
    pkey   integer,
    label  text,
    hidden boolean
);
CREATE TABLE rm32044_t3
(
        pkey integer,
        val integer
);
CREATE TABLE rm32044_t4
(
        pkey integer
);
insert into rm32044_t1 values ( 1 , 'row1');
insert into rm32044_t1 values ( 2 , 'row2');
insert into rm32044_t2 values ( 1 , 'hidden', true);
insert into rm32044_t2 values ( 2 , 'visible', false);
insert into rm32044_t3 values (1 , 1);
insert into rm32044_t3 values (2 , 1);

postgres=# SELECT * FROM rm32044_t1 LEFT JOIN rm32044_t2 ON rm32044_t1.pkey = rm32044_t2.pkey, rm32044_t3 LEFT JOIN rm32044_t4 ON rm32044_t3.pkey = rm32044_t4.pkey order by rm32044_t1.pkey,label,hidden;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.

backtrace:
[edb@9428da9d2137 postgres]$ gdb bin/postgres data/qemu_postgres_20230911-140628_65620.core
Core was generated by `postgres: edb postgres [local] SELECT  '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000010a8366 in heap_compute_data_size (tupleDesc=tupleDesc@entry=0x1ba3d10, values=values@entry=0x1ba4168, isnull=isnull@entry=0x1ba41a8) at heaptuple.c:227
227 VARATT_CAN_MAKE_SHORT(DatumGetPointer(val)))
[Current thread is 1 (LWP 65597)]
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.el9.s390x libcap-2.48-8.el9.s390x libedit-3.1-37.20210216cvs.el9.s390x libffi-3.4.2-7.el9.s390x libgcc-11.3.1-4.3.el9.alma.s390x libgcrypt-1.10.0-10.el9_2.s390x libgpg-error-1.42-5.el9.s390x libstdc++-11.3.1-4.3.el9.alma.s390x libxml2-2.9.13-3.el9_2.1.s390x libzstd-1.5.1-2.el9.s390x llvm-libs-15.0.7-1.el9.s390x lz4-libs-1.9.3-5.el9.s390x ncurses-libs-6.2-8.20210508.el9.s390x openssl-libs-3.0.7-17.el9_2.s390x systemd-libs-252-14.el9_2.3.s390x xz-libs-5.2.5-8.el9_0.s390x
(gdb) bt
#0  0x00000000010a8366 in heap_compute_data_size (tupleDesc=tupleDesc@entry=0x1ba3d10, values=values@entry=0x1ba4168, isnull=isnull@entry=0x1ba41a8) at heaptuple.c:227
#1  0x00000000010a9bb0 in heap_form_minimal_tuple (tupleDescriptor=0x1ba3d10, values=0x1ba4168, isnull=0x1ba41a8) at heaptuple.c:1484
#2  0x00000000016553fa in ExecCopySlotMinimalTuple (slot=<optimized out>) at ../../../../src/include/executor/tuptable.h:472
#3  tuplesort_puttupleslot (state=state@entry=0x1be4d18, slot=slot@entry=0x1ba4120) at tuplesortvariants.c:610
#4  0x00000000012dc0e0 in ExecIncrementalSort (pstate=0x1acb4d8) at nodeIncrementalSort.c:716
#5  0x00000000012b32c6 in ExecProcNode (node=0x1acb4d8) at ../../../src/include/executor/executor.h:273
#6  ExecutePlan (execute_once=<optimized out>, dest=0x1ade698, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x1acb4d8, estate=0x1acb258) at execMain.c:1670
#7  standard_ExecutorRun (queryDesc=0x19ad338, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:365
#8  0x00000000014a6ae2 in PortalRunSelect (portal=portal@entry=0x1a63558, forward=forward@entry=true, count=0, count@entry=9223372036854775807, dest=dest@entry=0x1ade698) at pquery.c:924
#9  0x00000000014a84e0 in PortalRun (portal=portal@entry=0x1a63558, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x1ade698, altdest=0x1ade698, qc=0x40007ff7b0) at pquery.c:768
#10 0x00000000014a3c1c in exec_simple_query (
    query_string=0x19ea0e8 "SELECT * FROM rm32044_t1 LEFT JOIN rm32044_t2 ON rm32044_t1.pkey = rm32044_t2.pkey, rm32044_t3 LEFT JOIN rm32044_t4 ON rm32044_t3.pkey = rm32044_t4.pkey order by rm32044_t1.pkey,label,hidden;") at postgres.c:1274
#11 0x00000000014a57aa in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4637
#12 0x00000000013fdaf6 in BackendRun (port=0x1a132c0, port=0x1a132c0) at postmaster.c:4464
#13 BackendStartup (port=0x1a132c0) at postmaster.c:4192
#14 ServerLoop () at postmaster.c:1782
#15 0x00000000013fec34 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x19a59a0) at postmaster.c:1466
#16 0x0000000001096faa in main (argc=<optimized out>, argv=0x19a59a0) at main.c:198

(gdb) p val
$1 = 0
```

Does anybody have any idea about this?

--
--

Thanks & Regards, 
Suraj kharage, 



--
--

Thanks & Regards, 
Suraj kharage, 



--
--

Thanks & Regards, 
Suraj kharage, 

pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only
Next
From: David Rowley
Date:
Subject: Re: Does anyone ever use OPTIMIZER_DEBUG?