Re: BUG #14420: Parallel worker segfault - Mailing list pgsql-bugs

From Rick Otten
Subject Re: BUG #14420: Parallel worker segfault
Date
Msg-id 8029759f8564c4960af1d15a544a8826@www.windfish.net
Whole thread Raw
In response to Re: BUG #14420: Parallel worker segfault  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-bugs
Sorry about forgetting to CC the bugs list when I replied.

I've enabled "-c", and made sure my PGDATA directory has enough space to
collect a full core image. If we get one, I'll let you know.

There were a lot of queries happening at the time of the seg fault. The
only new or unusual one that I am aware of was doing a UNION ALL,
between two nearly identical queries, where one side was doing a
parallel query scan, and the other side wasn't. I had just refactored it
from using an "OR" operand in the WHERE clause because it was much
faster that way.

Since Friday I ran another 1M or so of those queries, but it hasn't seg
faulted again.

On 2016-11-12 02:18, Amit Kapila wrote:

> On Sat, Nov 12, 2016 at 9:32 AM, Rick Otten <rotten@windfish.net> wrote:
>
> Please keep pgsql-bugs in the loop. It is important to keep everyone
> in the loop not only because it is a way to work in this community,
> but also because others can see something which I or you can't see.
>
>> PostgreSQL was not started with the "-c" option. I'll look into enabling that before this happens again.
>
> makes sense.
>
>> I'll read more from the other debugging article and see if there is anything I can do there as well. Thanks. There
wereno files generated and dropped in PGDATA this time, unfortunately. Sorry, I know this isn't much to go on, but it
isall I know at this time. There wasn't much else that wasn't routine in the logs before or after the two lines I
pastedbelow other than a bunch of warnings for the the 30 or 40 transactions that were in progress followed by this: 
>
> Okay, I think we can't get anything from these logs. I think once
> core is available, we can try to find the reason, but it would be much
> better if we can generate an independent test to reproduce this
> problem. One possible way could be to find the culprit query. You
> might want to log long-running queries, as parallelism will generally
> be used for such queries.
>
>> 2016-11-11 21:31:26.292 UTC WARNING: terminating connection because of crash of another server process 2016-11-11
21:31:26.292UTC DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit,
becauseanother server process exited abnormally and possibly corrupted shared memory. 2016-11-11 21:31:26.292 UTC HINT:
Ina moment you should be able to reconnect to the database and repeat your command. 2016-11-11 21:31:26.301 UTC
WARNING:terminating connection because of crash of another server process 2016-11-11 21:31:26.301 UTC DETAIL: The
postmasterhas commanded this server process to roll back the current transaction and exit, because another server
processexited abnormally and possibly corrupted shared memory. 2016-11-11 21:31:26.301 UTC HINT: In a moment you should
beable to reconnect to the database and repeat your command. 2016-11-11 21:31:30.762 UTC [unknown] x.x.x.x [unknown]
LOG:connection received: host=x.x.x.x port=47692 
2016-11-11 21:31:30.762 UTC clarivoy x.x.x.x some_user FATAL: the database system is in recovery mode 2016-11-11
21:31:31.766UTC LOG: all server processes terminated; reinitializing 2016-11-11 21:31:33.526 UTC LOG: database system
wasinterrupted; last known up at 2016-11-11 21:29:28 UTC 2016-11-11 21:31:33.660 UTC LOG: database system was not
properlyshut down; automatic recovery in progress 2016-11-11 21:31:33.674 UTC LOG: redo starts at 1DD/4F5A0320
2016-11-1121:31:33.957 UTC LOG: unexpected pageaddr 1DC/16AEC000 in log segment 00000001000001DD00000056, offset
114524162016-11-11 21:31:33.958 UTC LOG: redo done at 1DD/56AEB7F8 2016-11-11 21:31:33.958 UTC LOG: last completed
transactionwas at log time 2016-11-11 21:31:26.07448+00 2016-11-11 21:31:34.705 UTC LOG: MultiXact member wraparound
protectionsare now enabled 2016-11-11 21:31:34.724 UTC LOG: autovacuum launcher started 2016-11-11 21:31:34.725 UTC
LOG:database system is ready to accept connections After that the 
database was pretty much back to normal. Because everything connects from various pgbouncer instances running
elsewhere,they quickly reconnected and started working again without having to restart any applications or services. 

pgsql-bugs by date:

Previous
From: mikael.wallen@spark-vision.com
Date:
Subject: BUG #14423: Fail to connect to DB
Next
From: Josh Berkus
Date:
Subject: DOS-style line endings in .pgpass