Thread: BUG #16754: When using LLVM and parallel queries aborted all session by pg_cancel_backend.
BUG #16754: When using LLVM and parallel queries aborted all session by pg_cancel_backend.
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 16754 Logged by: Noriyoshi Shinoda Email address: noriyoshi.shinoda@hpe.com PostgreSQL version: 13.1 Operating system: Red Hat Enterprise Linux 7.8 (x86-64) Description: OS: Red Hat Enterprise Linux 7.8 (x86-64) Version: PostgreSQL 13.1 cfe-7.1.0 llvm-7.1.0 cmake-3.18.4 Configure: configure --prefix=$HOME/pg131 --with-llvm --enable-debug --enable-cassert $ pg_config BINDIR = /home/postgres/pg131/bin DOCDIR = /home/postgres/pg131/share/doc HTMLDIR = /home/postgres/pg131/share/doc INCLUDEDIR = /home/postgres/pg131/include PKGINCLUDEDIR = /home/postgres/pg131/include INCLUDEDIR-SERVER = /home/postgres/pg131/include/server LIBDIR = /home/postgres/pg131/lib PKGLIBDIR = /home/postgres/pg131/lib LOCALEDIR = /home/postgres/pg131/share/locale MANDIR = /home/postgres/pg131/share/man SHAREDIR = /home/postgres/pg131/share SYSCONFDIR = /home/postgres/pg131/etc PGXS = /home/postgres/pg131/lib/pgxs/src/makefiles/pgxs.mk CONFIGURE = '--prefix=/home/postgres/pg131' '--with-llvm' '--enable-debug' '--enable-cassert' CC = gcc -std=gnu99 CPPFLAGS = -D_GNU_SOURCE CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -O2 CFLAGS_SL = -fPIC LDFLAGS = -L/usr/local/lib -Wl,--as-needed -Wl,-rpath,'/home/postgres/pg131/lib',--enable-new-dtags LDFLAGS_EX = LDFLAGS_SL = LIBS = -lpgcommon -lpgport -lpthread -lz -lreadline -lrt -ldl -lm VERSION = PostgreSQL 13.1 - Prepare $ initdb --no-locale --encoding=utf8 data.131 $ pg_ctl -D data.131 start $ psql postgres=# SELECT version(); version --------------------------------------------------------------------------------------------------------- PostgreSQL 13.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit (1 row) postgres=# CREATE USER demo PASSWORD 'demo'; CREATE ROLE postgres=# \connect postgres demo You are now connected to database "postgres" as user "demo". postgres=> CREATE TABLE data1(c1 NUMERIC, c2 VARCHAR(10)); CREATE TABLE postgres=> INSERT INTO data1 VALUES (generate_series(1, 1000000), 'data1'); INSERT 0 1000000 postgres=> ANALYZE data1; ANALYZE postgres=> \d+ List of relations Schema | Name | Type | Owner | Persistence | Size | Description --------+-------+-------+-------+-------------+-------+------------- public | data1 | table | demo | permanent | 42 MB | (1 row) - Operation -- Normal case (No JIT/Cancel parallel worker) --- Session#1 $ psql postgres demo postgres=> SET force_parallel_mode = on; SET postgres=> SET jit = off; SET postgres=> EXPLAIN SELECT * FROM data1 d1, data1 d2; QUERY PLAN ------------------------------------------------------------------------------------- Gather (cost=1000.00..117383034312.00 rows=1000000000000 width=24) Workers Planned: 1 Single Copy: true -> Nested Loop (cost=0.00..17383033312.00 rows=1000000000000 width=24) -> Seq Scan on data1 d1 (cost=0.00..15406.00 rows=1000000 width=12) -> Materialize (cost=0.00..25289.00 rows=1000000 width=12) -> Seq Scan on data1 d2 (cost=0.00..15406.00 rows=1000000 width=12) (7 rows) postgres=> EXPLAIN ANALYZE SELECT * FROM data1 d1, data1 d2; <- Execute pg_cancel_backend(130457) on Session#2 ERROR: canceling statement due to user request CONTEXT: parallel worker --- Session#2 $ psql postgres=# SELECT pid, leader_pid FROM pg_stat_activity WHERE usename='demo'; pid | leader_pid --------+------------ 130413 | 130457 | 130413 (2 rows) postgres=# SELECT pg_cancel_backend(130457); pg_cancel_backend ------------------- t (1 row) -- NG case (With JIT / Cancel parallel worker) --- Session#1 $ psql postgres demo postgres=> SET force_parallel_mode = on; SET postgres=> SET jit = on; SET postgres=> EXPLAIN SELECT * FROM data1 d1, data1 d2; QUERY PLAN ------------------------------------------------------------------------------------- Gather (cost=1000.00..117383034312.00 rows=1000000000000 width=24) Workers Planned: 1 Single Copy: true -> Nested Loop (cost=0.00..17383033312.00 rows=1000000000000 width=24) -> Seq Scan on data1 d1 (cost=0.00..15406.00 rows=1000000 width=12) -> Materialize (cost=0.00..25289.00 rows=1000000 width=12) -> Seq Scan on data1 d2 (cost=0.00..15406.00 rows=1000000 width=12) JIT: Functions: 3 Options: Inlining true, Optimization true, Expressions true, Deforming true (10 rows) postgres=> EXPLAIN ANALYZE SELECT * FROM data1 d1, data1 d2; <- Execute pg_cancel_backend(130342); WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. ERROR: canceling statement due to user request CONTEXT: parallel worker server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !?> --- Session#2 postgres=# SELECT pid, leader_pid FROM pg_stat_activity WHERE usename='demo'; pid | leader_pid --------+------------ 130322 | 130342 | 130322 (2 rows) postgres=# SELECT pg_cancel_backend(130342); pg_cancel_backend ------------------- t (1 row) - log file 2020-12-01 14:26:02.337 JST [130270] LOG: background worker "parallel worker" (PID 130342) was terminated by signal 11: Segmentation fault 2020-12-01 14:26:02.337 JST [130270] DETAIL: Failed process was running: EXPLAIN ANALYZE SELECT * FROM data1 d1, data1 d2; 2020-12-01 14:26:02.337 JST [130270] LOG: terminating any other active server processes 2020-12-01 14:26:02.337 JST [130322] WARNING: terminating connection because of crash of another server process 2020-12-01 14:26:02.337 JST [130322] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2020-12-01 14:26:02.337 JST [130322] HINT: In a moment you should be able to reconnect to the database and repeat your command. 2020-12-01 14:26:02.338 JST [130276] WARNING: terminating connection because of crash of another server process 2020-12-01 14:26:02.338 JST [130276] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2020-12-01 14:26:02.338 JST [130276] HINT: In a moment you should be able to reconnect to the database and repeat your command. 2020-12-01 14:26:02.340 JST [130345] FATAL: the database system is in recovery mode 2020-12-01 14:26:02.343 JST [130319] WARNING: terminating connection because of crash of another server process 2020-12-01 14:26:02.343 JST [130319] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2020-12-01 14:26:02.343 JST [130319] HINT: In a moment you should be able to reconnect to the database and repeat your command. 2020-12-01 14:26:02.343 JST [130270] LOG: all server processes terminated; reinitializing 2020-12-01 14:26:02.355 JST [130348] LOG: database system was interrupted; last known up at 2020-12-01 14:21:07 JST 2020-12-01 14:26:02.360 JST [130348] LOG: database system was not properly shut down; automatic recovery in progress 2020-12-01 14:26:02.360 JST [130348] LOG: redo starts at 0/15C07D0 2020-12-01 14:26:02.767 JST [130348] LOG: invalid record length at 0/5B10A90: wanted 24, got 0 2020-12-01 14:26:02.767 JST [130348] LOG: redo done at 0/5B10A58 2020-12-01 14:26:02.821 JST [130270] LOG: database system is ready to accept connections
Re: BUG #16754: When using LLVM and parallel queries aborted all session by pg_cancel_backend.
From
Tom Lane
Date:
PG Bug reporting form <noreply@postgresql.org> writes: > SELECT pg_cancel_backend(130342); > ... > 2020-12-01 14:26:02.337 JST [130270] LOG: background worker "parallel > worker" (PID 130342) was terminated by signal 11: Segmentation fault > 2020-12-01 14:26:02.337 JST [130270] DETAIL: Failed process was running: > EXPLAIN ANALYZE SELECT * FROM data1 d1, data1 d2; FWIW, I failed to reproduce any crash here, using RHEL8 (llvm-10.0.1). I don't see any plausibly-related commits since 13.1, so it's not something we recently fixed. I wonder if this is an LLVM bug fixed since the 7.1 release you have. regards, tom lane
RE: BUG #16754: When using LLVM and parallel queries aborted all session by pg_cancel_backend.
From
"Shinoda, Noriyoshi (PN Japan FSIP)"
Date:
> I don't see any plausibly-related commits since 13.1, so it's not something we recently fixed. I wonder if this is anLLVM bug fixed since the 7.1 release you have. Thank you for your comment and test. I tried again with an older version of LLVM below. - PostgreSQL 13.1 - Red Hat Enterprise Linux 7.5 - LLVM 6.0 - cfe 6.0 I couldn't reproduce the problem int the above environment, so maybe it's a problem in my development environment. Regards, Noriyoshi Shinoda -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Wednesday, December 2, 2020 5:59 AM To: Shinoda, Noriyoshi (PN Japan FSIP) <noriyoshi.shinoda@hpe.com> Cc: pgsql-bugs@lists.postgresql.org; Andres Freund <andres@anarazel.de> Subject: Re: BUG #16754: When using LLVM and parallel queries aborted all session by pg_cancel_backend. PG Bug reporting form <noreply@postgresql.org> writes: > SELECT pg_cancel_backend(130342); > ... > 2020-12-01 14:26:02.337 JST [130270] LOG: background worker "parallel > worker" (PID 130342) was terminated by signal 11: Segmentation fault > 2020-12-01 14:26:02.337 JST [130270] DETAIL: Failed process was running: > EXPLAIN ANALYZE SELECT * FROM data1 d1, data1 d2; FWIW, I failed to reproduce any crash here, using RHEL8 (llvm-10.0.1). I don't see any plausibly-related commits since 13.1, so it's not something we recently fixed. I wonder if this is an LLVMbug fixed since the 7.1 release you have. regards, tom lane
Re: BUG #16754: When using LLVM and parallel queries aborted all session by pg_cancel_backend.
From
Andres Freund
Date:
Hi, On 2020-12-02 01:11:07 +0000, Shinoda, Noriyoshi (PN Japan FSIP) wrote: > > I don't see any plausibly-related commits since 13.1, so it's not something we recently fixed. I wonder if this is anLLVM bug fixed since the 7.1 release you have. > > Thank you for your comment and test. > I tried again with an older version of LLVM below. > > - PostgreSQL 13.1 > - Red Hat Enterprise Linux 7.5 > - LLVM 6.0 > - cfe 6.0 Could you try to reproduce the issue with the version you hit it earlier? It'd be helpful to get a backtrace. > I couldn't reproduce the problem int the above environment, so maybe it's a problem in my development environment. Did you have any modifications in the tree? Did you do a build, pulled some updates, and built again? I noticed that you didn't pass --enable-depend to ./configure, so it's possible that not all files got rebuilt that needed to. Regards, Andres
RE: BUG #16754: When using LLVM and parallel queries aborted all session by pg_cancel_backend.
From
"Shinoda, Noriyoshi (PN Japan FSIP)"
Date:
Hi, Thanks for the comment. I newly cloned the latest version 13.1 source and then rebuilt it. The segmentation fault has recurred.I tried the backtracefeature, but it didn't work. The function specification may not have been good. Segmentation faults do not always occur, but occasionally after running pg_cancel_backend several times. Also, it only seems to happen if I canceled the parallel worker process. Regards, Noriyoshi Shinoda -----Original Message----- From: Andres Freund [mailto:andres@anarazel.de] Sent: Friday, December 4, 2020 5:51 AM To: Shinoda, Noriyoshi (PN Japan FSIP) <noriyoshi.shinoda@hpe.com> Cc: Tom Lane <tgl@sss.pgh.pa.us>; pgsql-bugs@lists.postgresql.org Subject: Re: BUG #16754: When using LLVM and parallel queries aborted all session by pg_cancel_backend. Hi, On 2020-12-02 01:11:07 +0000, Shinoda, Noriyoshi (PN Japan FSIP) wrote: > > I don't see any plausibly-related commits since 13.1, so it's not something we recently fixed. I wonder if this is anLLVM bug fixed since the 7.1 release you have. > > Thank you for your comment and test. > I tried again with an older version of LLVM below. > > - PostgreSQL 13.1 > - Red Hat Enterprise Linux 7.5 > - LLVM 6.0 > - cfe 6.0 Could you try to reproduce the issue with the version you hit it earlier? It'd be helpful to get a backtrace. > I couldn't reproduce the problem int the above environment, so maybe it's a problem in my development environment. Did you have any modifications in the tree? Did you do a build, pulled some updates, and built again? I noticed that you didn't pass --enable-depend to ./configure,so it's possible that not all files got rebuilt that needed to. Regards, Andres
Attachment
Re: BUG #16754: When using LLVM and parallel queries aborted all session by pg_cancel_backend.
From
Andres Freund
Date:
Hi, On 2020-12-07 08:58:18 +0000, Shinoda, Noriyoshi (PN Japan FSIP) wrote: > Thanks for the comment. > I newly cloned the latest version 13.1 source and then rebuilt it. The > segmentation fault has recurred.I tried the backtrace feature, but it > didn't work. What exactly do you mean by that? You attached to the worker with a debugger, and you didn't get a backtrace once it crashed? Same with a core file? Or were you hoping for the segfault to automatically generate a backtrace? The easist way would be to enable core files with 'ulimit -c unlimited' in the shell you start postgres in, and to set the 'jit_debugging_support=1' option. Then, once the crash happend, you should be able to find a core file. You then can inspect that core file with gdb /path/to/postgres -core /path/to/core and execute 'bt' inside. > The function specification may not have been good. Which function's specification? > Segmentation faults do not always occur, but occasionally after running pg_cancel_backend several times. > Also, it only seems to happen if I canceled the parallel worker process. I tried this a couple hundred times without success. Greetings, Andres Freund
Re: BUG #16754: When using LLVM and parallel queries aborted all session by pg_cancel_backend.
From
Amit Langote
Date:
Reviving an old thread. On Tue, Dec 8, 2020 at 4:35 PM Andres Freund <andres@anarazel.de> wrote: > On 2020-12-07 08:58:18 +0000, Shinoda, Noriyoshi (PN Japan FSIP) wrote: > > Thanks for the comment. > > I newly cloned the latest version 13.1 source and then rebuilt it. The > > segmentation fault has recurred.I tried the backtrace feature, but it > > didn't work. > > What exactly do you mean by that? You attached to the worker with a > debugger, and you didn't get a backtrace once it crashed? Same with a > core file? Or were you hoping for the segfault to automatically > generate a backtrace? > > The easist way would be to enable core files with 'ulimit -c unlimited' > in the shell you start postgres in, and to set the > 'jit_debugging_support=1' option. Then, once the crash happend, you > should be able to find a core file. > > You then can inspect that core file with > gdb /path/to/postgres -core /path/to/core > > and execute 'bt' inside. > > > > The function specification may not have been good. > > Which function's specification? > > > > Segmentation faults do not always occur, but occasionally after running pg_cancel_backend several times. > > Also, it only seems to happen if I canceled the parallel worker process. > > I tried this a couple hundred times without success. I have happened to run into an odd JIT-related crash that may be related. It happens (almost) every time when I run make installcheck with force_parallel_mode=regress and build with LLVM 7.0. The backtrace looks like this: (gdb) bt #0 0x00007f88fe18c8cd in std::_Function_handler<void (unsigned long, llvm::object::ObjectFile const&), llvm::OrcCBindingsStack::OrcCBindingsStack(llvm::TargetMachine&, std::function<std::unique_ptr<llvm::orc::IndirectStubsManager, std::default_delete<llvm::orc::IndirectStubsManager> > ()>)::{lambda(unsigned long, llvm::object::ObjectFile const&)#3}>::_M_invoke(std::_Any_data const&, unsigned long, llvm::object::ObjectFile const&) () from /opt/rh/llvm-toolset-7.0/root/usr/lib64/libLLVM-7.so #1 0x00007f88fe18e578 in llvm::orc::RTDyldObjectLinkingLayer::ConcreteLinkedObject<std::shared_ptr<llvm::RuntimeDyld::MemoryManager> >::~ConcreteLinkedObject() () from /opt/rh/llvm-toolset-7.0/root/usr/lib64/libLLVM-7.so #2 0x00007f88fe18e7aa in std::_Rb_tree<unsigned long, std::pair<unsigned long const, std::unique_ptr<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject, std::default_delete<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject> > >, std::_Select1st<std::pair<unsigned long const, std::unique_ptr<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject, std::default_delete<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject> > > >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::unique_ptr<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject, std::default_delete<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject> > > > >::_M_erase(std::_Rb_tree_node<std::pair<unsigned long const, std::unique_ptr<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject, std::default_delete<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject> > > >*) () from /opt/rh/llvm-toolset-7.0/root/usr/lib64/libLLVM-7.so #3 0x00007f88fe19ac91 in llvm::OrcCBindingsStack::~OrcCBindingsStack() () from /opt/rh/llvm-toolset-7.0/root/usr/lib64/libLLVM-7.so #4 0x00007f88fe19afaa in LLVMOrcDisposeInstance () from /opt/rh/llvm-toolset-7.0/root/usr/lib64/libLLVM-7.so #5 0x00007f8900189c21 in llvm_shutdown (code=1, arg=0) at llvmjit.c:940 #6 0x0000000000ce541e in proc_exit_prepare (code=1) at ipc.c:209 #7 0x0000000000ce5223 in proc_exit (code=1) at ipc.c:107 #8 0x0000000000bd8414 in StartBackgroundWorker () at bgworker.c:821 #9 0x0000000000bec38c in do_start_bgworker (rw=0x3380b30) at postmaster.c:5809 #10 0x0000000000becab3 in maybe_start_bgworkers () at postmaster.c:6033 #11 0x0000000000bea9fa in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5190 #12 <signal handler called> #13 0x00007f8908976b23 in __select_nocancel () from /lib64/libc.so.6 #14 0x0000000000be20de in ServerLoop () at postmaster.c:1772 #15 0x0000000000be1646 in PostmasterMain (argc=3, argv=0x33585b0) at postmaster.c:1480 #16 0x00000000009e0794 in main (argc=3, argv=0x33585b0) at main.c:197 -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Re: BUG #16754: When using LLVM and parallel queries aborted all session by pg_cancel_backend.
From
Amit Langote
Date:
On Thu, Jul 28, 2022 at 1:11 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Tue, Dec 8, 2020 at 4:35 PM Andres Freund <andres@anarazel.de> wrote: > > On 2020-12-07 08:58:18 +0000, Shinoda, Noriyoshi (PN Japan FSIP) wrote: > > > Thanks for the comment. > > > I newly cloned the latest version 13.1 source and then rebuilt it. The > > > segmentation fault has recurred.I tried the backtrace feature, but it > > > didn't work. > > > > What exactly do you mean by that? You attached to the worker with a > > debugger, and you didn't get a backtrace once it crashed? Same with a > > core file? Or were you hoping for the segfault to automatically > > generate a backtrace? > > > > The easist way would be to enable core files with 'ulimit -c unlimited' > > in the shell you start postgres in, and to set the > > 'jit_debugging_support=1' option. Then, once the crash happend, you > > should be able to find a core file. > > > > You then can inspect that core file with > > gdb /path/to/postgres -core /path/to/core > > > > and execute 'bt' inside. > > > > > > > The function specification may not have been good. > > > > Which function's specification? > > > > > > > Segmentation faults do not always occur, but occasionally after running pg_cancel_backend several times. > > > Also, it only seems to happen if I canceled the parallel worker process. > > > > I tried this a couple hundred times without success. > > I have happened to run into an odd JIT-related crash that may be related. > > It happens (almost) every time when I run make installcheck with > force_parallel_mode=regress and build with LLVM 7.0. Forgot to mention: no crash when using LLVM 9.0 though. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com