Thread: BUG #16754: When using LLVM and parallel queries aborted all session by pg_cancel_backend.

BUG #16754: When using LLVM and parallel queries aborted all session by pg_cancel_backend.

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      16754
Logged by:          Noriyoshi Shinoda
Email address:      noriyoshi.shinoda@hpe.com
PostgreSQL version: 13.1
Operating system:   Red Hat Enterprise Linux 7.8 (x86-64)
Description:

OS: 
Red Hat Enterprise Linux 7.8 (x86-64)

Version:
PostgreSQL 13.1
cfe-7.1.0
llvm-7.1.0
cmake-3.18.4

Configure:
 configure --prefix=$HOME/pg131 --with-llvm --enable-debug
--enable-cassert

$ pg_config
BINDIR = /home/postgres/pg131/bin
DOCDIR = /home/postgres/pg131/share/doc
HTMLDIR = /home/postgres/pg131/share/doc
INCLUDEDIR = /home/postgres/pg131/include
PKGINCLUDEDIR = /home/postgres/pg131/include
INCLUDEDIR-SERVER = /home/postgres/pg131/include/server
LIBDIR = /home/postgres/pg131/lib
PKGLIBDIR = /home/postgres/pg131/lib
LOCALEDIR = /home/postgres/pg131/share/locale
MANDIR = /home/postgres/pg131/share/man
SHAREDIR = /home/postgres/pg131/share
SYSCONFDIR = /home/postgres/pg131/etc
PGXS = /home/postgres/pg131/lib/pgxs/src/makefiles/pgxs.mk
CONFIGURE =  '--prefix=/home/postgres/pg131' '--with-llvm' '--enable-debug'
'--enable-cassert'
CC = gcc -std=gnu99
CPPFLAGS = -D_GNU_SOURCE
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Werror=vla -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv
-fexcess-precision=standard -g -O2
CFLAGS_SL = -fPIC
LDFLAGS = -L/usr/local/lib -Wl,--as-needed
-Wl,-rpath,'/home/postgres/pg131/lib',--enable-new-dtags
LDFLAGS_EX =
LDFLAGS_SL =
LIBS = -lpgcommon -lpgport -lpthread -lz -lreadline -lrt -ldl -lm
VERSION = PostgreSQL 13.1

- Prepare
$ initdb --no-locale --encoding=utf8 data.131
$ pg_ctl -D data.131 start
$ psql
postgres=# SELECT version();
                                                 version
---------------------------------------------------------------------------------------------------------
 PostgreSQL 13.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
20150623 (Red Hat 4.8.5-39), 64-bit
(1 row)
postgres=# CREATE USER demo PASSWORD 'demo';
CREATE ROLE
postgres=# \connect postgres demo
You are now connected to database "postgres" as user "demo".
postgres=> CREATE TABLE data1(c1 NUMERIC, c2 VARCHAR(10));
CREATE TABLE
postgres=> INSERT INTO data1 VALUES (generate_series(1, 1000000),
'data1');
INSERT 0 1000000
postgres=> ANALYZE data1;
ANALYZE
postgres=> \d+
                         List of relations
 Schema | Name  | Type  | Owner | Persistence | Size  | Description
--------+-------+-------+-------+-------------+-------+-------------
 public | data1 | table | demo  | permanent   | 42 MB |
(1 row)

- Operation
-- Normal case (No JIT/Cancel parallel worker)
--- Session#1
$ psql postgres demo
postgres=> SET force_parallel_mode = on;
SET
postgres=> SET jit = off;
SET
postgres=> EXPLAIN SELECT * FROM data1 d1, data1 d2;
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Gather  (cost=1000.00..117383034312.00 rows=1000000000000 width=24)
   Workers Planned: 1
   Single Copy: true
   ->  Nested Loop  (cost=0.00..17383033312.00 rows=1000000000000
width=24)
         ->  Seq Scan on data1 d1  (cost=0.00..15406.00 rows=1000000
width=12)
         ->  Materialize  (cost=0.00..25289.00 rows=1000000 width=12)
               ->  Seq Scan on data1 d2  (cost=0.00..15406.00 rows=1000000
width=12)
(7 rows)
    
postgres=> EXPLAIN ANALYZE SELECT * FROM data1 d1, data1 d2;        <- Execute
pg_cancel_backend(130457) on Session#2
ERROR:  canceling statement due to user request
CONTEXT:  parallel worker

--- Session#2
$ psql
postgres=# SELECT pid, leader_pid FROM pg_stat_activity WHERE
usename='demo';
  pid   | leader_pid
--------+------------
 130413 |
 130457 |     130413
(2 rows)

postgres=# SELECT pg_cancel_backend(130457);
 pg_cancel_backend
-------------------
 t
(1 row)


-- NG case (With JIT / Cancel parallel worker)
--- Session#1

$ psql postgres demo    
postgres=> SET force_parallel_mode = on;
SET
postgres=> SET jit = on;
SET
postgres=> EXPLAIN SELECT * FROM data1 d1, data1 d2;
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Gather  (cost=1000.00..117383034312.00 rows=1000000000000 width=24)
   Workers Planned: 1
   Single Copy: true
   ->  Nested Loop  (cost=0.00..17383033312.00 rows=1000000000000
width=24)
         ->  Seq Scan on data1 d1  (cost=0.00..15406.00 rows=1000000
width=12)
         ->  Materialize  (cost=0.00..25289.00 rows=1000000 width=12)
               ->  Seq Scan on data1 d2  (cost=0.00..15406.00 rows=1000000
width=12)
 JIT:
   Functions: 3
   Options: Inlining true, Optimization true, Expressions true, Deforming
true
(10 rows)

postgres=> EXPLAIN ANALYZE SELECT * FROM data1 d1, data1 d2;    <- Execute
pg_cancel_backend(130342);
WARNING:  terminating connection because of crash of another server
process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
ERROR:  canceling statement due to user request
CONTEXT:  parallel worker
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?>

--- Session#2
postgres=# SELECT pid, leader_pid FROM pg_stat_activity WHERE
usename='demo';
  pid   | leader_pid
--------+------------
 130322 |
 130342 |     130322
(2 rows)

postgres=# SELECT pg_cancel_backend(130342);
 pg_cancel_backend
-------------------
 t
(1 row)

- log file
2020-12-01 14:26:02.337 JST [130270] LOG:  background worker "parallel
worker" (PID 130342) was terminated by signal 11: Segmentation fault
2020-12-01 14:26:02.337 JST [130270] DETAIL:  Failed process was running:
EXPLAIN ANALYZE SELECT * FROM data1 d1, data1 d2;
2020-12-01 14:26:02.337 JST [130270] LOG:  terminating any other active
server processes
2020-12-01 14:26:02.337 JST [130322] WARNING:  terminating connection
because of crash of another server process
2020-12-01 14:26:02.337 JST [130322] DETAIL:  The postmaster has commanded
this server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2020-12-01 14:26:02.337 JST [130322] HINT:  In a moment you should be able
to reconnect to the database and repeat your command.
2020-12-01 14:26:02.338 JST [130276] WARNING:  terminating connection
because of crash of another server process
2020-12-01 14:26:02.338 JST [130276] DETAIL:  The postmaster has commanded
this server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2020-12-01 14:26:02.338 JST [130276] HINT:  In a moment you should be able
to reconnect to the database and repeat your command.
2020-12-01 14:26:02.340 JST [130345] FATAL:  the database system is in
recovery mode
2020-12-01 14:26:02.343 JST [130319] WARNING:  terminating connection
because of crash of another server process
2020-12-01 14:26:02.343 JST [130319] DETAIL:  The postmaster has commanded
this server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2020-12-01 14:26:02.343 JST [130319] HINT:  In a moment you should be able
to reconnect to the database and repeat your command.
2020-12-01 14:26:02.343 JST [130270] LOG:  all server processes terminated;
reinitializing
2020-12-01 14:26:02.355 JST [130348] LOG:  database system was interrupted;
last known up at 2020-12-01 14:21:07 JST
2020-12-01 14:26:02.360 JST [130348] LOG:  database system was not properly
shut down; automatic recovery in progress
2020-12-01 14:26:02.360 JST [130348] LOG:  redo starts at 0/15C07D0
2020-12-01 14:26:02.767 JST [130348] LOG:  invalid record length at
0/5B10A90: wanted 24, got 0
2020-12-01 14:26:02.767 JST [130348] LOG:  redo done at 0/5B10A58
2020-12-01 14:26:02.821 JST [130270] LOG:  database system is ready to
accept connections


PG Bug reporting form <noreply@postgresql.org> writes:
> SELECT pg_cancel_backend(130342);
> ...
> 2020-12-01 14:26:02.337 JST [130270] LOG:  background worker "parallel
> worker" (PID 130342) was terminated by signal 11: Segmentation fault
> 2020-12-01 14:26:02.337 JST [130270] DETAIL:  Failed process was running:
> EXPLAIN ANALYZE SELECT * FROM data1 d1, data1 d2;

FWIW, I failed to reproduce any crash here, using RHEL8 (llvm-10.0.1).
I don't see any plausibly-related commits since 13.1, so it's not
something we recently fixed.  I wonder if this is an LLVM bug fixed
since the 7.1 release you have.

            regards, tom lane



RE: BUG #16754: When using LLVM and parallel queries aborted all session by pg_cancel_backend.

From
"Shinoda, Noriyoshi (PN Japan FSIP)"
Date:
> I don't see any plausibly-related commits since 13.1, so it's not something we recently fixed.  I wonder if this is
anLLVM bug fixed since the 7.1 release you have. 

Thank you for your comment and test.
I tried again with an older version of LLVM below.

- PostgreSQL 13.1
- Red Hat Enterprise Linux 7.5
- LLVM 6.0
- cfe 6.0

I couldn't reproduce the problem int the above environment, so maybe it's a problem in my development environment.

Regards,
Noriyoshi Shinoda

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Wednesday, December 2, 2020 5:59 AM
To: Shinoda, Noriyoshi (PN Japan FSIP) <noriyoshi.shinoda@hpe.com>
Cc: pgsql-bugs@lists.postgresql.org; Andres Freund <andres@anarazel.de>
Subject: Re: BUG #16754: When using LLVM and parallel queries aborted all session by pg_cancel_backend.

PG Bug reporting form <noreply@postgresql.org> writes:
> SELECT pg_cancel_backend(130342);
> ...
> 2020-12-01 14:26:02.337 JST [130270] LOG:  background worker "parallel
> worker" (PID 130342) was terminated by signal 11: Segmentation fault
> 2020-12-01 14:26:02.337 JST [130270] DETAIL:  Failed process was running:
> EXPLAIN ANALYZE SELECT * FROM data1 d1, data1 d2;

FWIW, I failed to reproduce any crash here, using RHEL8 (llvm-10.0.1).
I don't see any plausibly-related commits since 13.1, so it's not something we recently fixed.  I wonder if this is an
LLVMbug fixed since the 7.1 release you have. 

            regards, tom lane



Hi,

On 2020-12-02 01:11:07 +0000, Shinoda, Noriyoshi (PN Japan FSIP) wrote:
> > I don't see any plausibly-related commits since 13.1, so it's not something we recently fixed.  I wonder if this is
anLLVM bug fixed since the 7.1 release you have.
 
> 
> Thank you for your comment and test.
> I tried again with an older version of LLVM below.
> 
> - PostgreSQL 13.1
> - Red Hat Enterprise Linux 7.5
> - LLVM 6.0
> - cfe 6.0

Could you try to reproduce the issue with the version you hit it
earlier? It'd be helpful to get a backtrace.


> I couldn't reproduce the problem int the above environment, so maybe it's a problem in my development environment.

Did you have any modifications in the tree?

Did you do a build, pulled some updates, and built again? I noticed
that you didn't pass --enable-depend to ./configure, so it's possible
that not all files got rebuilt that needed to.

Regards,

Andres



RE: BUG #16754: When using LLVM and parallel queries aborted all session by pg_cancel_backend.

From
"Shinoda, Noriyoshi (PN Japan FSIP)"
Date:
Hi,

Thanks for the comment.
I newly cloned the latest version 13.1 source and then rebuilt it. The segmentation fault has recurred.I tried the
backtracefeature, but it didn't work. The function specification may not have been good. 
Segmentation faults do not always occur, but occasionally after running pg_cancel_backend several times.
Also, it only seems to happen if I canceled the parallel worker process.

Regards,
Noriyoshi Shinoda
-----Original Message-----
From: Andres Freund [mailto:andres@anarazel.de]
Sent: Friday, December 4, 2020 5:51 AM
To: Shinoda, Noriyoshi (PN Japan FSIP) <noriyoshi.shinoda@hpe.com>
Cc: Tom Lane <tgl@sss.pgh.pa.us>; pgsql-bugs@lists.postgresql.org
Subject: Re: BUG #16754: When using LLVM and parallel queries aborted all session by pg_cancel_backend.

Hi,

On 2020-12-02 01:11:07 +0000, Shinoda, Noriyoshi (PN Japan FSIP) wrote:
> > I don't see any plausibly-related commits since 13.1, so it's not something we recently fixed.  I wonder if this is
anLLVM bug fixed since the 7.1 release you have. 
>
> Thank you for your comment and test.
> I tried again with an older version of LLVM below.
>
> - PostgreSQL 13.1
> - Red Hat Enterprise Linux 7.5
> - LLVM 6.0
> - cfe 6.0

Could you try to reproduce the issue with the version you hit it earlier? It'd be helpful to get a backtrace.


> I couldn't reproduce the problem int the above environment, so maybe it's a problem in my development environment.

Did you have any modifications in the tree?

Did you do a build, pulled some updates, and built again? I noticed that you didn't pass --enable-depend to
./configure,so it's possible that not all files got rebuilt that needed to. 

Regards,

Andres

Attachment
Hi,

On 2020-12-07 08:58:18 +0000, Shinoda, Noriyoshi (PN Japan FSIP) wrote:
> Thanks for the comment.
> I newly cloned the latest version 13.1 source and then rebuilt it. The
> segmentation fault has recurred.I tried the backtrace feature, but it
> didn't work.

What exactly do you mean by that? You attached to the worker with a
debugger, and you didn't get a backtrace once it crashed? Same with a
core file?  Or were you hoping for the segfault to automatically
generate a backtrace?

The easist way would be to enable core files with 'ulimit -c unlimited'
in the shell you start postgres in, and to set the
'jit_debugging_support=1' option. Then, once the crash happend, you
should be able to find a core file.

You then can inspect that core file with
gdb /path/to/postgres -core /path/to/core

and execute 'bt' inside.


> The function specification may not have been good.

Which function's specification?


> Segmentation faults do not always occur, but occasionally after running pg_cancel_backend several times.
> Also, it only seems to happen if I canceled the parallel worker process.

I tried this a couple hundred times without success.


Greetings,

Andres Freund



Reviving an old thread.

On Tue, Dec 8, 2020 at 4:35 PM Andres Freund <andres@anarazel.de> wrote:
> On 2020-12-07 08:58:18 +0000, Shinoda, Noriyoshi (PN Japan FSIP) wrote:
> > Thanks for the comment.
> > I newly cloned the latest version 13.1 source and then rebuilt it. The
> > segmentation fault has recurred.I tried the backtrace feature, but it
> > didn't work.
>
> What exactly do you mean by that? You attached to the worker with a
> debugger, and you didn't get a backtrace once it crashed? Same with a
> core file?  Or were you hoping for the segfault to automatically
> generate a backtrace?
>
> The easist way would be to enable core files with 'ulimit -c unlimited'
> in the shell you start postgres in, and to set the
> 'jit_debugging_support=1' option. Then, once the crash happend, you
> should be able to find a core file.
>
> You then can inspect that core file with
> gdb /path/to/postgres -core /path/to/core
>
> and execute 'bt' inside.
>
>
> > The function specification may not have been good.
>
> Which function's specification?
>
>
> > Segmentation faults do not always occur, but occasionally after running pg_cancel_backend several times.
> > Also, it only seems to happen if I canceled the parallel worker process.
>
> I tried this a couple hundred times without success.

I have happened to run into an odd JIT-related crash that may be related.

It happens (almost) every time when I run make installcheck with
force_parallel_mode=regress and build with LLVM 7.0.

The backtrace looks like this:

(gdb) bt
#0  0x00007f88fe18c8cd in std::_Function_handler<void (unsigned long,
llvm::object::ObjectFile const&),
llvm::OrcCBindingsStack::OrcCBindingsStack(llvm::TargetMachine&,
std::function<std::unique_ptr<llvm::orc::IndirectStubsManager,
std::default_delete<llvm::orc::IndirectStubsManager> >
()>)::{lambda(unsigned long, llvm::object::ObjectFile
const&)#3}>::_M_invoke(std::_Any_data const&, unsigned long,
llvm::object::ObjectFile const&) ()
   from /opt/rh/llvm-toolset-7.0/root/usr/lib64/libLLVM-7.so
#1  0x00007f88fe18e578 in
llvm::orc::RTDyldObjectLinkingLayer::ConcreteLinkedObject<std::shared_ptr<llvm::RuntimeDyld::MemoryManager>
>::~ConcreteLinkedObject() () from
/opt/rh/llvm-toolset-7.0/root/usr/lib64/libLLVM-7.so
#2  0x00007f88fe18e7aa in std::_Rb_tree<unsigned long,
std::pair<unsigned long const,
std::unique_ptr<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject,
std::default_delete<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject>
> >, std::_Select1st<std::pair<unsigned long const,
std::unique_ptr<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject,
std::default_delete<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject>
> > >, std::less<unsigned long>, std::allocator<std::pair<unsigned
long const, std::unique_ptr<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject,
std::default_delete<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject>
> > > >::_M_erase(std::_Rb_tree_node<std::pair<unsigned long const,
std::unique_ptr<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject,
std::default_delete<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject>
> > >*) ()
   from /opt/rh/llvm-toolset-7.0/root/usr/lib64/libLLVM-7.so
#3  0x00007f88fe19ac91 in llvm::OrcCBindingsStack::~OrcCBindingsStack() ()
   from /opt/rh/llvm-toolset-7.0/root/usr/lib64/libLLVM-7.so
#4  0x00007f88fe19afaa in LLVMOrcDisposeInstance () from
/opt/rh/llvm-toolset-7.0/root/usr/lib64/libLLVM-7.so
#5  0x00007f8900189c21 in llvm_shutdown (code=1, arg=0) at llvmjit.c:940
#6  0x0000000000ce541e in proc_exit_prepare (code=1) at ipc.c:209
#7  0x0000000000ce5223 in proc_exit (code=1) at ipc.c:107
#8  0x0000000000bd8414 in StartBackgroundWorker () at bgworker.c:821
#9  0x0000000000bec38c in do_start_bgworker (rw=0x3380b30) at postmaster.c:5809
#10 0x0000000000becab3 in maybe_start_bgworkers () at postmaster.c:6033
#11 0x0000000000bea9fa in sigusr1_handler (postgres_signal_arg=10) at
postmaster.c:5190
#12 <signal handler called>
#13 0x00007f8908976b23 in __select_nocancel () from /lib64/libc.so.6
#14 0x0000000000be20de in ServerLoop () at postmaster.c:1772
#15 0x0000000000be1646 in PostmasterMain (argc=3, argv=0x33585b0) at
postmaster.c:1480
#16 0x00000000009e0794 in main (argc=3, argv=0x33585b0) at main.c:197

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com



On Thu, Jul 28, 2022 at 1:11 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Tue, Dec 8, 2020 at 4:35 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2020-12-07 08:58:18 +0000, Shinoda, Noriyoshi (PN Japan FSIP) wrote:
> > > Thanks for the comment.
> > > I newly cloned the latest version 13.1 source and then rebuilt it. The
> > > segmentation fault has recurred.I tried the backtrace feature, but it
> > > didn't work.
> >
> > What exactly do you mean by that? You attached to the worker with a
> > debugger, and you didn't get a backtrace once it crashed? Same with a
> > core file?  Or were you hoping for the segfault to automatically
> > generate a backtrace?
> >
> > The easist way would be to enable core files with 'ulimit -c unlimited'
> > in the shell you start postgres in, and to set the
> > 'jit_debugging_support=1' option. Then, once the crash happend, you
> > should be able to find a core file.
> >
> > You then can inspect that core file with
> > gdb /path/to/postgres -core /path/to/core
> >
> > and execute 'bt' inside.
> >
> >
> > > The function specification may not have been good.
> >
> > Which function's specification?
> >
> >
> > > Segmentation faults do not always occur, but occasionally after running pg_cancel_backend several times.
> > > Also, it only seems to happen if I canceled the parallel worker process.
> >
> > I tried this a couple hundred times without success.
>
> I have happened to run into an odd JIT-related crash that may be related.
>
> It happens (almost) every time when I run make installcheck with
> force_parallel_mode=regress and build with LLVM 7.0.

Forgot to mention: no crash when using LLVM 9.0 though.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com