Thread: [BUGS] signal 11 segfaults with parallel workers

[BUGS] signal 11 segfaults with parallel workers

From
Rick Otten
Date:
Starting a couple of weeks ago, our PostgreSQL database has been crashing, almost daily, with a signal 11 seg fault on a query as the triggering event:

2017-07-11 23:00:29.984 UTC     LOG:  worker process: parallel worker for PID 1055 (PID 12405) was terminated by signal 11: Segmentation fault
2017-07-12 23:01:56.432 UTC     LOG:  worker process: parallel worker for PID 5752 (PID 32552) was terminated by signal 11: Segmentation fault
2017-07-14 23:00:46.856 UTC     LOG:  worker process: parallel worker for PID 24280 (PID 9639) was terminated by signal 11: Segmentation fault
2017-07-15 23:01:24.317 UTC     LOG:  worker process: parallel worker for PID 1561 (PID 15153) was terminated by signal 11: Segmentation fault
2017-07-16 23:00:26.722 UTC     LOG:  worker process: parallel worker for PID 5776 (PID 7912) was terminated by signal 11: Segmentation fault
2017-07-17 18:58:14.155 UTC     LOG:  worker process: parallel worker for PID 11427 (PID 9998) was terminated by signal 11: Segmentation fault
2017-07-17 19:08:04.103 UTC     LOG:  worker process: parallel worker for PID 10190 (PID 11907) was terminated by signal 11: Segmentation fault
2017-07-18 23:01:09.775 UTC     LOG:  worker process: parallel worker for PID 29445 (PID 360) was terminated by signal 11: Segmentation fault
2017-07-19 18:46:58.676 UTC     LOG:  worker process: parallel worker for PID 7080 (PID 27710) was terminated by signal 11: Segmentation fault
2017-07-20 23:00:35.270 UTC     LOG:  worker process: parallel worker for PID 19153 (PID 21218) was terminated by signal 11: Segmentation fault
2017-07-21 23:00:41.085 UTC     LOG:  worker process: parallel worker for PID 19161 (PID 30720) was terminated by signal 11: Segmentation fault
2017-07-22 23:00:22.169 UTC     LOG:  worker process: parallel worker for PID 4903 (PID 6931) was terminated by signal 11: Segmentation fault
2017-07-25 23:02:03.688 UTC     LOG:  worker process: parallel worker for PID 11099 (PID 11280) was terminated by signal 11: Segmentation fault

 As near as I can tell there were no specific changes preceding this pattern which might be a root cause.  Since then I've tried patching the Linux instance and bounced the database server, and bumped up the number of connections (because we were running low sometimes).  None of those changes impacted the regular crashing pattern.

On Sunday (2017-07-23) I set DEBUG5 on all log events, and set it to also log all queries, so I could try to learn more about what was happening.  I found the culprit query last night, from one of our daily jobs.  It was doing a 5 thread parallel sequence scan on a moderately sized table (maybe 70 columns, by 2.5M rows).

I was not able to force the database to crash by running this query by hand.  I tried a number of times.  Although it did happen to someone else on the 17th.

For now, I've put an index on the relevant columns to avoid the parallel sequence scan for that query.  I also repacked the table.  Hopefully we won't crash tonight too.

There wasn't much extra in the logs to share about the crash.  This is from when it crashed:

2017-07-25 23:02:03.688 UTC     DEBUG:  reaping dead processes
2017-07-25 23:02:03.688 UTC     LOG:  worker process: parallel worker for PID 11099 (PID 11280) was terminated by signal 11: Segmentation fault

And this is when it spun out those parallel workers, just prior to the segfault, that let me identify the query in question:

2017-07-25 23:02:01.804 UTC     DEBUG:  registering background worker "parallel worker for PID 11099"
2017-07-25 23:02:01.804 UTC     DEBUG:  registering background worker "parallel worker for PID 11099"
2017-07-25 23:02:01.804 UTC     DEBUG:  registering background worker "parallel worker for PID 11099"
2017-07-25 23:02:01.804 UTC     DEBUG:  registering background worker "parallel worker for PID 11099"
2017-07-25 23:02:01.804 UTC     DEBUG:  registering background worker "parallel worker for PID 11099"
2017-07-25 23:02:01.804 UTC     DEBUG:  starting background worker process "parallel worker for PID 11099"
2017-07-25 23:02:01.805 UTC     DEBUG:  starting background worker process "parallel worker for PID 11099"
2017-07-25 23:02:01.805 UTC     DEBUG:  starting background worker process "parallel worker for PID 11099"
2017-07-25 23:02:01.806 UTC     DEBUG:  starting background worker process "parallel worker for PID 11099"
2017-07-25 23:02:01.806 UTC     DEBUG:  starting background worker process "parallel worker for PID 11099"


Here is what /var/log/kern.log had to say about the one from last night:

Jul 25 23:02:01 core-gce kernel: [738031.417934] postgres[11279]: segfault at 8 ip 000055dc24e22403 sp 00007ffc84dbf6f0 error 4 in postgres[55dc249cd000+64c000]
Jul 25 23:02:01 core-gce kernel: [738031.417953] postgres[11278]: segfault at 8 ip 000055dc24e22403 sp 00007ffc84dbf6f0 error 4 in postgres[55dc249cd000+64c000]
Jul 25 23:02:01 core-gce kernel: [738031.417967] postgres[11280]: segfault at 8 ip 000055dc24e22403 sp 00007ffc84dbf6f0 error 4 in postgres[55dc249cd000+64c000]
Jul 25 23:02:01 core-gce kernel: [738031.417989] postgres[11276]: segfault at 8 ip 000055dc24e22403 sp 00007ffc84dbf6f0 error 4 in postgres[55dc249cd000+64c000]


I'm running on Ubuntu 16.04.02 in Google Compute Environment, on a 16 core VM with 104G RAM, using the Ubuntu Postgresql 9.6.3 package.

$  pg_config --configure
'--with-tcl' '--with-perl' '--with-python' '--with-pam' '--with-openssl' '--with-libxml' '--with-libxslt' '--with-tclconfig=/usr/lib/x86_64-linux-gnu/tcl8.6' '--with-includes=/usr/include/tcl8.6' 'PYTHON=/usr/bin/python' '--mandir=/usr/share/postgresql/9.6/man' '--docdir=/usr/share/doc/postgresql-doc-9.6' '--sysconfdir=/etc/postgresql-common' '--datarootdir=/usr/share/' '--datadir=/usr/share/postgresql/9.6' '--bindir=/usr/lib/postgresql/9.6/bin' '--libdir=/usr/lib/x86_64-linux-gnu/' '--libexecdir=/usr/lib/postgresql/' '--includedir=/usr/include/postgresql/' '--enable-nls' '--enable-integer-datetimes' '--enable-thread-safety' '--enable-tap-tests' '--enable-debug' '--disable-rpath' '--with-uuid=e2fs' '--with-gnu-ld' '--with-pgport=5432' '--with-system-tzdata=/usr/share/zoneinfo' '--with-systemd' 'CFLAGS=-g -O2 -fstack-protector-strong -Wformat -Werror=format-security -I/usr/include/mit-krb5 -fPIC -pie -fno-omit-frame-pointer' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now -Wl,--as-needed -L/usr/lib/mit-krb5 -L/usr/lib/x86_64-linux-gnu/mit-krb5' '--with-krb5' '--with-gssapi' '--with-ldap' '--with-selinux' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2'

$  pg_config --ldflags
-L../../src/common -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now -Wl,--as-needed -L/usr/lib/mit-krb5 -L/usr/lib/x86_64-linux-gnu/mit-krb5 -Wl,--as-needed

$  pg_config --cflags
-Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -I/usr/include/mit-krb5 -fPIC -pie -fno-omit-frame-pointer

I have these two most relevant settings enabled in my configuration:
max_worker_processes = 16
max_parallel_workers_per_gather = 16

If you need anything else, please let me know.  I wish I could reproduce the error every time I ran the query, but it doesn't seem to work that way, and of course now the query plan is completely different, but I'm sure I can run other queries that would induce parallel sequence scans on my tables.


Re: [BUGS] signal 11 segfaults with parallel workers

From
Michael Paquier
Date:
On Wed, Jul 26, 2017 at 4:47 PM, Rick Otten <rottenwindfish@gmail.com> wrote:
> If you need anything else, please let me know.  I wish I could reproduce the
> error every time I ran the query, but it doesn't seem to work that way, and
> of course now the query plan is completely different, but I'm sure I can run
> other queries that would induce parallel sequence scans on my tables.

Backtrace of the core files generated with debug symbols on, and a
minimum test case to reproduce the failure usually help.
-- 
Michael


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] signal 11 segfaults with parallel workers

From
Rick Otten
Date:
I don't have any core files.  I suppose that is something I have to enable specifically?   I'm game to turn it on in case we core dump again.

If I could get it to fail every time I ran the query, I'm sure I could build a test case for you.  Sorry. :-(   

On Wed, Jul 26, 2017 at 10:58 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Wed, Jul 26, 2017 at 4:47 PM, Rick Otten <rottenwindfish@gmail.com> wrote:
> If you need anything else, please let me know.  I wish I could reproduce the
> error every time I ran the query, but it doesn't seem to work that way, and
> of course now the query plan is completely different, but I'm sure I can run
> other queries that would induce parallel sequence scans on my tables.

Backtrace of the core files generated with debug symbols on, and a
minimum test case to reproduce the failure usually help.
--
Michael

Re: [BUGS] signal 11 segfaults with parallel workers

From
Tom Lane
Date:
Rick Otten <rottenwindfish@gmail.com> writes:
> I don't have any core files.  I suppose that is something I have to enable
> specifically?   I'm game to turn it on in case we core dump again.

If you're not seeing core files, you probably need to take measures
to make the postmaster run with "ulimit -c unlimited".  It's fairly
common for daemon processes to get launched under "ulimit -c 0"
by default, for largely-misguided-imo security reasons.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] signal 11 segfaults with parallel workers

From
Rick Otten
Date:
I'll restart the database tonight to pick up the ulimit change and let you know if I capture a core file in the near future.


On Wed, Jul 26, 2017 at 11:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Rick Otten <rottenwindfish@gmail.com> writes:
> I don't have any core files.  I suppose that is something I have to enable
> specifically?   I'm game to turn it on in case we core dump again.

If you're not seeing core files, you probably need to take measures
to make the postmaster run with "ulimit -c unlimited".  It's fairly
common for daemon processes to get launched under "ulimit -c 0"
by default, for largely-misguided-imo security reasons.

                        regards, tom lane

Re: [BUGS] signal 11 segfaults with parallel workers

From
David Gould
Date:
On Wed, 26 Jul 2017 11:43:22 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Rick Otten <rottenwindfish@gmail.com> writes:
> > I don't have any core files.  I suppose that is something I have to enable
> > specifically?   I'm game to turn it on in case we core dump again.  
> 
> If you're not seeing core files, you probably need to take measures
> to make the postmaster run with "ulimit -c unlimited".  It's fairly
> common for daemon processes to get launched under "ulimit -c 0"
> by default, for largely-misguided-imo security reasons.

If you are using pg_ctl to start postgresql you can add the "-c" flag to your
pg_ctl command to enable core files.

-dg

-- 
David Gould              510 282 0869         daveg@sonic.net
If simplicity worked, the world would be overrun with insects.


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] signal 11 segfaults with parallel workers

From
Rick Otten
Date:
FWIW, the database crashed again tonight.  It hasn't been quiet enough yet to be able to restart it in a controlled fashion to enable cores.  Hopefully I'll get a chance this weekend! 

2017-07-27 23:01:20.411 UTC     LOG:  worker process: parallel worker for PID 31472 (PID 2186) was terminated by signal 11: Segmentation fault

Since I didn't have statement logging and debug turned on this time, I can only guess which query seg faulted.

Is enabling DEBUG in the postgresql.conf sufficient to enable debug symbols in the core, or do I have to rebuild the postgresql binaries to get that?  Is the core of any use without debug symbols enabled?


On Wed, Jul 26, 2017 at 12:26 PM, Rick Otten <rottenwindfish@gmail.com> wrote:
I'll restart the database tonight to pick up the ulimit change and let you know if I capture a core file in the near future.


On Wed, Jul 26, 2017 at 11:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Rick Otten <rottenwindfish@gmail.com> writes:
> I don't have any core files.  I suppose that is something I have to enable
> specifically?   I'm game to turn it on in case we core dump again.

If you're not seeing core files, you probably need to take measures
to make the postmaster run with "ulimit -c unlimited".  It's fairly
common for daemon processes to get launched under "ulimit -c 0"
by default, for largely-misguided-imo security reasons.

                        regards, tom lane


Re: [BUGS] signal 11 segfaults with parallel workers

From
Tom Lane
Date:
Rick Otten <rottenwindfish@gmail.com> writes:
> Is enabling DEBUG in the postgresql.conf sufficient to enable debug symbols
> in the core, or do I have to rebuild the postgresql binaries to get that?

You would need to recompile (with --enable-debug added to configure
switches) if they're not there already.  But if you used somebody's
packaging rather than a homebrew build, you can probably get the
symbols installed without doing your own build.

> Is the core of any use without debug symbols enabled?

You should still be able to get a stack trace out of it, but the trace
would be much more informative with debug symbols.  See
https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD
        regards, tom lane


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] signal 11 segfaults with parallel workers

From
Rick Otten
Date:
I'm using the Ubuntu PostgresSQL 9.6.3 from this repo:

It looks like there is a "-dbg" package available:
   postgresql-9.6-dbg - debug symbols for postgresql-9.6

I'll give that a try when I get the restart opportunity.



On Thu, Jul 27, 2017 at 8:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Rick Otten <rottenwindfish@gmail.com> writes:
> Is enabling DEBUG in the postgresql.conf sufficient to enable debug symbols
> in the core, or do I have to rebuild the postgresql binaries to get that?

You would need to recompile (with --enable-debug added to configure
switches) if they're not there already.  But if you used somebody's
packaging rather than a homebrew build, you can probably get the
symbols installed without doing your own build.

> Is the core of any use without debug symbols enabled?

You should still be able to get a stack trace out of it, but the trace
would be much more informative with debug symbols.  See
https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

                        regards, tom lane

Re: [BUGS] signal 11 segfaults with parallel workers

From
Alvaro Herrera
Date:
Rick Otten wrote:
> I'm using the Ubuntu PostgresSQL 9.6.3 from this repo:
> deb http://apt.postgresql.org/pub/repos/apt/ xenial-pgdg main
> 
> It looks like there is a "-dbg" package available:
>    postgresql-9.6-dbg - debug symbols for postgresql-9.6
> 
> I'll give that a try when I get the restart opportunity.

You can install the -dbg package without waiting for a restart; it won't
disrupt anything.  Also, if you already got a core from the last crash,
installing that package now would be enough to be able to extract info
from the core file, assuming the -dbg package is the same version as the
package version that was running when it crashed.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] signal 11 segfaults with parallel workers

From
Rick Otten
Date:
Thanks!  I've got the -dbg package installed and I've restarted the server and the database this morning.  We've continued to crash almost every night, so if the restart doesn't do something strange, I should have a core file within a day or two.

One thing that is bugging me is I think when the database crashes, it doesn't clean up the temp_tablespace(s).  I've noticed as I'm working through this issue that the temp tablespace keeps creeping up in size and there doesn't seem to be any obvious way to recover that space.  I've been keeping up with it for now by making the disk bigger, but obviously I can't do that indefinitely.

I was debating making a new temp tablespace and then dropping the old one, but there must be an easier, safe way to clear dangling temp tablespace stuff?  Google wasn't terribly helpful to uncover strategies for dealing with temp tablespace bloat.


On Fri, Jul 28, 2017 at 3:02 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Rick Otten wrote:
> I'm using the Ubuntu PostgresSQL 9.6.3 from this repo:
> deb http://apt.postgresql.org/pub/repos/apt/ xenial-pgdg main
>
> It looks like there is a "-dbg" package available:
>    postgresql-9.6-dbg - debug symbols for postgresql-9.6
>
> I'll give that a try when I get the restart opportunity.

You can install the -dbg package without waiting for a restart; it won't
disrupt anything.  Also, if you already got a core from the last crash,
installing that package now would be enough to be able to extract info
from the core file, assuming the -dbg package is the same version as the
package version that was running when it crashed.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [BUGS] signal 11 segfaults with parallel workers

From
Tom Lane
Date:
Rick Otten <rottenwindfish@gmail.com> writes:
> One thing that is bugging me is I think when the database crashes, it
> doesn't clean up the temp_tablespace(s).

Hm, interesting, what do you see in there?
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] signal 11 segfaults with parallel workers

From
Rick Otten
Date:
Well, I'm not sure how to inspect the temp tablespace other than from the filesystem itself.  I have it configured on its own disk.  Usually the disk space ebbs and flows with query activity.  Since we've been crashing however, it never reclaims the disk that was in use just before the crash.  So our temp space 'floor" keeps getting higher and higher.

At least that is what it has been doing for the past week or two, and what it looked like this morning.  Now that the database has been back up for 8 or 9 hours following this controlled restart, I just went to look at it, and all of the temp space has been reclaimed - for the first time since the crashing started. ... Interesting...


On Sun, Jul 30, 2017 at 11:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Rick Otten <rottenwindfish@gmail.com> writes:
> One thing that is bugging me is I think when the database crashes, it
> doesn't clean up the temp_tablespace(s).

Hm, interesting, what do you see in there?

                        regards, tom lane

Re: [BUGS] signal 11 segfaults with parallel workers

From
Rick Otten
Date:
Ok, I got a core this time at 23:00 when the database went down.
Here is the basic backtrace:

$  gdb /usr/lib/postgresql/9.6/bin/postgres core
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
Find the GDB manual and other documentation resources online at:
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/lib/postgresql/9.6/bin/postgres...Reading symbols from /usr/lib/debug/.build-id/32/108810b4ff9528a94d48315dd9333c501fc52d.debug...done.
done.
[New LWP 4294]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: bgworker: parallel worker f'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  MemoryContextAlloc (context=0x0, size=size@entry=1024) at /build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/utils/mmgr/mcxt.c:761
761 /build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/utils/mmgr/mcxt.c: No such file or directory.
(gdb) bt
#0  MemoryContextAlloc (context=0x0, size=size@entry=1024) at /build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/utils/mmgr/mcxt.c:761
#1  0x0000560b7a518ec4 in SPI_connect () at /build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/executor/spi.c:102
#2  0x00007fec467b9261 in _PG_init () from /usr/lib/postgresql/9.6/lib/multicorn.so
#3  0x0000560b7a717cf2 in internal_load_library (libname=libname@entry=0x7ff48208dbf8 <error: Cannot access memory at address 0x7ff48208dbf8>)
    at /build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/utils/fmgr/dfmgr.c:276
#4  0x0000560b7a7188c0 in RestoreLibraryState (start_address=0x7ff48208dbf8 <error: Cannot access memory at address 0x7ff48208dbf8>)
    at /build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/utils/fmgr/dfmgr.c:741
#5  0x0000560b7a3ee4f7 in ParallelWorkerMain (main_arg=<optimized out>)
    at /build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/access/transam/parallel.c:1065
#6  0x0000560b7a59ae29 in StartBackgroundWorker () at /build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/postmaster/bgworker.c:742
#7  0x0000560b7a5a701b in do_start_bgworker (rw=<optimized out>)
    at /build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/postmaster/postmaster.c:5579
#8  maybe_start_bgworkers () at /build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/postmaster/postmaster.c:5776
#9  0x0000560b7a5a7cd5 in sigusr1_handler (postgres_signal_arg=<optimized out>)
    at /build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/postmaster/postmaster.c:4973
#10 <signal handler called>
#11 0x00007ff480425573 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:84
#12 0x0000560b7a3858ef in ServerLoop () at /build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/postmaster/postmaster.c:1679
#13 0x0000560b7a5a9053 in PostmasterMain (argc=1, argv=<optimized out>)
    at /build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/postmaster/postmaster.c:1323
#14 0x0000560b7a387511 in main (argc=1, argv=0x560b7ba23630) at /build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/main/main.c:228
(gdb) 

The query that took it down this time (based on the pid reported in the stacktrace) does indeed spin out a parallel plan, but it is a simple query.  I was surprised to see the multicorn library mentioned in this trace, it has nothing to do with the multicorn FDW installed on the system.

I've run the query several times in the last few minutes and can't get it to generate a core again.



On Sun, Jul 30, 2017 at 5:25 PM, Rick Otten <rottenwindfish@gmail.com> wrote:
Well, I'm not sure how to inspect the temp tablespace other than from the filesystem itself.  I have it configured on its own disk.  Usually the disk space ebbs and flows with query activity.  Since we've been crashing however, it never reclaims the disk that was in use just before the crash.  So our temp space 'floor" keeps getting higher and higher.

At least that is what it has been doing for the past week or two, and what it looked like this morning.  Now that the database has been back up for 8 or 9 hours following this controlled restart, I just went to look at it, and all of the temp space has been reclaimed - for the first time since the crashing started. ... Interesting...


On Sun, Jul 30, 2017 at 11:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Rick Otten <rottenwindfish@gmail.com> writes:
> One thing that is bugging me is I think when the database crashes, it
> doesn't clean up the temp_tablespace(s).

Hm, interesting, what do you see in there?

                        regards, tom lane


Re: [BUGS] signal 11 segfaults with parallel workers

From
Amit Kapila
Date:
On Mon, Jul 31, 2017 at 6:35 AM, Rick Otten <rottenwindfish@gmail.com> wrote:
> Ok, I got a core this time at 23:00 when the database went down.
> Here is the basic backtrace:
>
>
> The query that took it down this time (based on the pid reported in the
> stacktrace) does indeed spin out a parallel plan, but it is a simple query.
> I was surprised to see the multicorn library mentioned in this trace, it has
> nothing to do with the multicorn FDW installed on the system.
>

We load all the libraries in parallel workers which are loaded by the
main backend.  This is to ensure that master and worker backends have
exactly the same guc's defined in the worker.

> I've run the query several times in the last few minutes and can't get it to
> generate a core again.
>

Did the query take the parallel plan during execution?  The above
symptom shows that it should crash if you run the same query after
restarting the server.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] signal 11 segfaults with parallel workers

From
Andres Freund
Date:
Hi,

On 2017-07-30 21:05:50 -0400, Rick Otten wrote:
> Ok, I got a core this time at 23:00 when the database went down.
> Here is the basic backtrace:

> (gdb) bt
> #0  MemoryContextAlloc (context=0x0, size=size@entry=1024) at
> /build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/utils/mmgr/mcxt.c:761
> #1  0x0000560b7a518ec4 in SPI_connect () at
> /build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/executor/spi.c:102
> #2  0x00007fec467b9261 in _PG_init () from
> /usr/lib/postgresql/9.6/lib/multicorn.so
> #3  0x0000560b7a717cf2 in internal_load_library
> (libname=libname@entry=0x7ff48208dbf8
> <error: Cannot access memory at address 0x7ff48208dbf8>)
>     at
> /build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/utils/fmgr/dfmgr.c:276
> #4  0x0000560b7a7188c0 in RestoreLibraryState (start_address=0x7ff48208dbf8
> <error: Cannot access memory at address 0x7ff48208dbf8>)
>     at

Rick: Looks like a buglet in multicorn, which seems to expect to be
called in a valid memory context. Can you reproduce the bug if you use
multicorn, and then in the same session execute the problematic query?

Robert, was it intentional that we don't have a memory context defined
at this point?

Regards,

Andres


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] signal 11 segfaults with parallel workers

From
Amit Kapila
Date:
On Mon, Jul 31, 2017 at 8:26 AM, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> On 2017-07-30 21:05:50 -0400, Rick Otten wrote:
>> Ok, I got a core this time at 23:00 when the database went down.
>> Here is the basic backtrace:
>
>> (gdb) bt
>> #0  MemoryContextAlloc (context=0x0, size=size@entry=1024) at
>> /build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/utils/mmgr/mcxt.c:761
>> #1  0x0000560b7a518ec4 in SPI_connect () at
>> /build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/executor/spi.c:102
>> #2  0x00007fec467b9261 in _PG_init () from
>> /usr/lib/postgresql/9.6/lib/multicorn.so
>> #3  0x0000560b7a717cf2 in internal_load_library
>> (libname=libname@entry=0x7ff48208dbf8
>> <error: Cannot access memory at address 0x7ff48208dbf8>)
>>     at
>> /build/postgresql-9.6-5bnRDZ/postgresql-9.6-9.6.3/build/../src/backend/utils/fmgr/dfmgr.c:276
>> #4  0x0000560b7a7188c0 in RestoreLibraryState (start_address=0x7ff48208dbf8
>> <error: Cannot access memory at address 0x7ff48208dbf8>)
>>     at
>
> Rick: Looks like a buglet in multicorn, which seems to expect to be
> called in a valid memory context. Can you reproduce the bug if you use
> multicorn, and then in the same session execute the problematic query?
>
> Robert, was it intentional that we don't have a memory context defined
> at this point?
>

There is already a "Parallel Worker" memory context defined by that
time.  I think the issue is that multicorn library expects that
Transaction context to be defined by that time.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] signal 11 segfaults with parallel workers

From
Tom Lane
Date:
Amit Kapila <amit.kapila16@gmail.com> writes:
> There is already a "Parallel Worker" memory context defined by that
> time.  I think the issue is that multicorn library expects that
> Transaction context to be defined by that time.

It looks like multicorn supposes that a library's _PG_init function can
only be called inside a transaction.  That is broken with a capital B.
We need not consider parallel query to find counterexamples: that
means you can't preload multicorn using shared_preload_libraries,
as that loads libraries into the postmaster, which never has and never
will run transactions.

Whatever it's trying to initialize in _PG_init needs to be done later.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] signal 11 segfaults with parallel workers

From
Michael Paquier
Date:
On Mon, Jul 31, 2017 at 5:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapila16@gmail.com> writes:
>> There is already a "Parallel Worker" memory context defined by that
>> time.  I think the issue is that multicorn library expects that
>> Transaction context to be defined by that time.
>
> It looks like multicorn supposes that a library's _PG_init function can
> only be called inside a transaction.  That is broken with a capital B.
> We need not consider parallel query to find counterexamples: that
> means you can't preload multicorn using shared_preload_libraries,
> as that loads libraries into the postmaster, which never has and never
> will run transactions.
>
> Whatever it's trying to initialize in _PG_init needs to be done later.

Indeed, that's bad. I am adding in CC Ronan who has been working on
multicorn. At this stage, I think that you would be better out by
disabling parallelism.
-- 
Michael


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] signal 11 segfaults with parallel workers

From
Rick Otten
Date:
Just to follow up.   The database has not crashed since I disabled parallelism.   As a result of that change, some of my queries are running dramatically slower, I'm still working on doing what I can to get them back up to reasonable performance.   I look forward to a solution that allows both FDW extensions and parallel queries to coexist in the same database.

On Mon, Jul 31, 2017 at 3:52 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Mon, Jul 31, 2017 at 5:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapila16@gmail.com> writes:
>> There is already a "Parallel Worker" memory context defined by that
>> time.  I think the issue is that multicorn library expects that
>> Transaction context to be defined by that time.
>
> It looks like multicorn supposes that a library's _PG_init function can
> only be called inside a transaction.  That is broken with a capital B.
> We need not consider parallel query to find counterexamples: that
> means you can't preload multicorn using shared_preload_libraries,
> as that loads libraries into the postmaster, which never has and never
> will run transactions.
>
> Whatever it's trying to initialize in _PG_init needs to be done later.

Indeed, that's bad. I am adding in CC Ronan who has been working on
multicorn. At this stage, I think that you would be better out by
disabling parallelism.
--
Michael

Re: [BUGS] signal 11 segfaults with parallel workers

From
Andres Freund
Date:
Hi,

(please quote "properly" on pg lists, also don't trim the CC list wholesale)

(and sorry for the duplicate email, I somehow managed to break the mail
headers into the body)

On 2017-08-08 15:41:38 -0400, Rick Otten wrote:
> > > Whatever it's trying to initialize in _PG_init needs to be done later.
> >
> > Indeed, that's bad. I am adding in CC Ronan who has been working on
> > multicorn. At this stage, I think that you would be better out by
> > disabling parallelism.

> Just to follow up.   The database has not crashed since I disabled
> parallelism.   As a result of that change, some of my queries are running
> dramatically slower, I'm still working on doing what I can to get them back
> up to reasonable performance.   I look forward to a solution that allows
> both FDW extensions and parallel queries to coexist in the same database.

This is going to need a multicorn bugfix. This isn't a postgres bug, and
other FDWs can coexist with parallelism.  Therefore I unfortunately
think we can't really do anything here.

Perhaps, for v11, we should actually make sure there's no memory context
etc set during _PG_init() to catch such problems earlier? It's a bit
nasty to only see them if the shared library is preloaded and/or
parallelism is in use.

Greetings,

Andres Freund


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] signal 11 segfaults with parallel workers

From
Michael Paquier
Date:
On Tue, Aug 8, 2017 at 9:51 PM, Andres Freund <andres@anarazel.de> wrote:
> Perhaps, for v11, we should actually make sure there's no memory context
> etc set during _PG_init() to catch such problems earlier? It's a bit
> nasty to only see them if the shared library is preloaded and/or
> parallelism is in use.

Yeah, some prevention like that would be a good idea for module
developers. We could also check for a higher-level thing like being
sure that there is no transaction context?
-- 
Michael


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] signal 11 segfaults with parallel workers

From
Andres Freund
Date:
On 2017-08-08 21:55:50 +0200, Michael Paquier wrote:
> On Tue, Aug 8, 2017 at 9:51 PM, Andres Freund <andres@anarazel.de> wrote:
> > Perhaps, for v11, we should actually make sure there's no memory context
> > etc set during _PG_init() to catch such problems earlier? It's a bit
> > nasty to only see them if the shared library is preloaded and/or
> > parallelism is in use.
> 
> Yeah, some prevention like that would be a good idea for module
> developers.

> We could also check for a higher-level thing like being sure that
> there is no transaction context?

Not quite sure what you mean by that? And if you just mean to ensure
that _PG_init() is called outside of a transaction - how? We load shared
libraries on demand when they're used - and that'll frequently be in a
transaction?

- Andres


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] signal 11 segfaults with parallel workers

From
Amit Kapila
Date:
On Wed, Aug 9, 2017 at 1:27 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-08-08 21:55:50 +0200, Michael Paquier wrote:
>> On Tue, Aug 8, 2017 at 9:51 PM, Andres Freund <andres@anarazel.de> wrote:
>> > Perhaps, for v11, we should actually make sure there's no memory context
>> > etc set during _PG_init() to catch such problems earlier? It's a bit
>> > nasty to only see them if the shared library is preloaded and/or
>> > parallelism is in use.
>>
>> Yeah, some prevention like that would be a good idea for module
>> developers.
>
>> We could also check for a higher-level thing like being sure that
>> there is no transaction context?
>
> Not quite sure what you mean by that? And if you just mean to ensure
> that _PG_init() is called outside of a transaction - how? We load shared
> libraries on demand when they're used - and that'll frequently be in a
> transaction?
>

Yes.  I was also thinking along those lines and don't you think we
should better do RestoreLibraryState in a transaction in the parallel
worker.  Currently, we restore GUCState in a transaction and I think
LibraryState can also be restored in a transaction (See
ParallelWorkerMain).  I have checked locally and it seems to be
working.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs