Thread: BUG #14420: Parallel worker segfault

BUG #14420: Parallel worker segfault

From
rotten@windfish.net
Date:
VGhlIGZvbGxvd2luZyBidWcgaGFzIGJlZW4gbG9nZ2VkIG9uIHRoZSB3ZWJz
aXRlOgoKQnVnIHJlZmVyZW5jZTogICAgICAxNDQyMApMb2dnZWQgYnk6ICAg
ICAgICAgIFJpY2sgT3R0ZW4KRW1haWwgYWRkcmVzczogICAgICByb3R0ZW5A
d2luZGZpc2gubmV0ClBvc3RncmVTUUwgdmVyc2lvbjogOS42LjEKT3BlcmF0
aW5nIHN5c3RlbTogICBVYnVudHUgMTYuMDQKRGVzY3JpcHRpb246ICAgICAg
ICAKClRoaXMganVzdCBoYXBwZW5lZCBvbiBteSBwcm9kdWN0aW9uIDkuNi4x
IHN5c3RlbToNCg0KMjAxNi0xMS0xMSAyMTozMToyNi4yMTUgVVRDICAgIExP
RzogIHdvcmtlciBwcm9jZXNzOiBwYXJhbGxlbCB3b3JrZXIgZm9yIFBJRAox
NzM3IChQSUQgNjc5Mikgd2FzIHRlcm1pbmF0ZWQgYnkgc2lnbmFsIDExOiBT
ZWdtZW50YXRpb24gZmF1bHQNCjIwMTYtMTEtMTEgMjE6MzE6MjYuMjE1IFVU
QyAgICBMT0c6ICB0ZXJtaW5hdGluZyBhbnkgb3RoZXIgYWN0aXZlIHNlcnZl
cgpwcm9jZXNzZXMNCg0KSSBoYWQgYSBsb3Qgb2Ygc3R1ZmYgcnVubmluZyBj
b25jdXJyZW50bHksIHNvIEknbSBub3Qgc3VyZSB3aGljaCBxdWVyeQpjYXVz
ZWQgdGhlIGZhaWx1cmUuICBUaGVyZSBpcyBub3RoaW5nIGVsc2Ugb2J2aW91
cyBvciBpbGx1bWluYXRpbmcgaW4gdGhlCmxvZ3MgdGhhdCBJIGNhbiBzZWUu
ICBBZnRlciBpdCByZWNvdmVyZWQsIGV2ZXJ5dGhpbmcgd2VudCBiYWNrIHRv
IG5vcm1hbC4NCg0KSSBmaWd1cmVkIEkgc2hvdWxkIHJlcG9ydCB0aGlzIGV2
ZW4gaWYgSSBkb24ndCBoYXZlIG11Y2ggdG8gZ28gb24geWV0Lg0KDQpJcyB0
aGVyZSBhIGNvcmUgZmlsZSBzb21ld2hlcmUgdGhhdCB3b3VsZCBiZSBvZiBo
ZWxwPyAgKEknbSBub3Qgc2VlaW5nCm9uZS4pDQoNClNob3VsZCBJIHR1cm4g
dXAgZGVidWcgb3IgZW5hYmxlIHNvbWV0aGluZyB0byBjYXRjaCBtb3JlIGlu
Zm9ybWF0aW9uIGlmIGl0CmhhcHBlbnMgYWdhaW4/ICAgDQoNCnNlbGVjdCB2
ZXJzaW9uKCk7DQogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg
ICAgICAgICAgICAgICAgdmVyc2lvbg0KLS0tLS0tLS0tLS0tLS0tLS0tLS0t
LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t
LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQ0KIFBv
c3RncmVTUUwgOS42LjEgb24geDg2XzY0LXBjLWxpbnV4LWdudSwgY29tcGls
ZWQgYnkgZ2NjIChVYnVudHUKNS4zLjEtMTR1YnVudHUyKSA1LjMuMSAyMDE2
MDQxMywgNjQtYml0DQoNCg0KCgo=

Re: BUG #14420: Parallel worker segfault

From
Amit Kapila
Date:
On Sat, Nov 12, 2016 at 3:36 AM,  <rotten@windfish.net> wrote:
> The following bug has been logged on the website:
>
> Bug reference:      14420
> Logged by:          Rick Otten
> Email address:      rotten@windfish.net
> PostgreSQL version: 9.6.1
> Operating system:   Ubuntu 16.04
> Description:
>
> This just happened on my production 9.6.1 system:
>
> 2016-11-11 21:31:26.215 UTC    LOG:  worker process: parallel worker for PID
> 1737 (PID 6792) was terminated by signal 11: Segmentation fault
> 2016-11-11 21:31:26.215 UTC    LOG:  terminating any other active server
> processes
>
> I had a lot of stuff running concurrently, so I'm not sure which query
> caused the failure.  There is nothing else obvious or illuminating in the
> logs that I can see.  After it recovered, everything went back to normal.
>

It is difficult to find out the problem by above information.  Can you
share server log?  I am not sure if we can find anything from it, but
at least we can try.

> I figured I should report this even if I don't have much to go on yet.
>

Thanks for the report.

> Is there a core file somewhere that would be of help?  (I'm not seeing
> one.)
>

If it is generated it should be in $PGDATA directory.   I have found
few articles [1][2] on web related to this which could be useful for
you.

> Should I turn up debug or enable something to catch more information if it
> happens again?
>

I think core file can help here.


[1] - https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD
[2] - http://bajis-postgres.blogspot.in/2014/03/want-to-ensure-that-postgres-generates.html

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: BUG #14420: Parallel worker segfault

From
Amit Kapila
Date:
On Sat, Nov 12, 2016 at 9:32 AM, Rick Otten <rotten@windfish.net> wrote:

Please keep pgsql-bugs in the loop.  It is important to keep everyone
in the loop not only because it is a way to work in this community,
but also because others can see something which I or you can't see.

> PostgreSQL was not started with the "-c" option.  I'll look into enabling
> that before this happens again.
>

makes sense.

> I'll read more from the other debugging article and see if there is anything
> I can do there as well.  Thanks.
>
> There were no files generated and dropped in PGDATA this time,
> unfortunately.
>
> Sorry, I know this isn't much to go on, but it is all I know at this time.
>
> There wasn't much else that wasn't routine in the logs before or after the
> two lines I pasted below other than a bunch of warnings for the the 30 or 40
> transactions that were in progress followed by this:
>

Okay, I think we can't get anything from these logs.  I think once
core is available, we can try to find the reason, but it would be much
better if we can generate an independent test to reproduce this
problem.  One possible way could be to find the culprit query.  You
might want to log long-running queries, as parallelism will generally
be used for such queries.

> 2016-11-11 21:31:26.292 UTC WARNING: terminating connection because of crash
> of another server process
> 2016-11-11 21:31:26.292 UTC DETAIL: The postmaster has commanded this server
> process to roll back the current transaction and exit, because another
> server process exited abnormally and possibly corrupted shared memory.
> 2016-11-11 21:31:26.292 UTC HINT: In a moment you should be able to
> reconnect to the database and repeat your command.
> 2016-11-11 21:31:26.301 UTC WARNING: terminating connection because of crash
> of another server process
> 2016-11-11 21:31:26.301 UTC DETAIL: The postmaster has commanded this server
> process to roll back the current transaction and exit, because another
> server process exited abnormally and possibly corrupted shared memory.
> 2016-11-11 21:31:26.301 UTC HINT: In a moment you should be able to
> reconnect to the database and repeat your command.
> 2016-11-11 21:31:30.762 UTC [unknown] x.x.x.x [unknown] LOG: connection
> received: host=x.x.x.x port=47692
> 2016-11-11 21:31:30.762 UTC clarivoy x.x.x.x some_user FATAL: the database
> system is in recovery mode
> 2016-11-11 21:31:31.766 UTC LOG: all server processes terminated;
> reinitializing
> 2016-11-11 21:31:33.526 UTC LOG: database system was interrupted; last known
> up at 2016-11-11 21:29:28 UTC
> 2016-11-11 21:31:33.660 UTC LOG: database system was not properly shut down;
> automatic recovery in progress
> 2016-11-11 21:31:33.674 UTC LOG: redo starts at 1DD/4F5A0320
> 2016-11-11 21:31:33.957 UTC LOG: unexpected pageaddr 1DC/16AEC000 in log
> segment 00000001000001DD00000056, offset 11452416
> 2016-11-11 21:31:33.958 UTC LOG: redo done at 1DD/56AEB7F8
> 2016-11-11 21:31:33.958 UTC LOG: last completed transaction was at log time
> 2016-11-11 21:31:26.07448+00
> 2016-11-11 21:31:34.705 UTC LOG: MultiXact member wraparound protections are
> now enabled
> 2016-11-11 21:31:34.724 UTC LOG: autovacuum launcher started
> 2016-11-11 21:31:34.725 UTC LOG: database system is ready to accept
> connections
>
> After that the database was pretty much back to normal.  Because everything
> connects from various pgbouncer instances running elsewhere, they quickly
> reconnected and started working again without having to restart any
> applications or services.
>



--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: BUG #14420: Parallel worker segfault

From
Rick Otten
Date:
Sorry about forgetting to CC the bugs list when I replied.

I've enabled "-c", and made sure my PGDATA directory has enough space to
collect a full core image. If we get one, I'll let you know.

There were a lot of queries happening at the time of the seg fault. The
only new or unusual one that I am aware of was doing a UNION ALL,
between two nearly identical queries, where one side was doing a
parallel query scan, and the other side wasn't. I had just refactored it
from using an "OR" operand in the WHERE clause because it was much
faster that way.

Since Friday I ran another 1M or so of those queries, but it hasn't seg
faulted again.

On 2016-11-12 02:18, Amit Kapila wrote:

> On Sat, Nov 12, 2016 at 9:32 AM, Rick Otten <rotten@windfish.net> wrote:
>
> Please keep pgsql-bugs in the loop. It is important to keep everyone
> in the loop not only because it is a way to work in this community,
> but also because others can see something which I or you can't see.
>
>> PostgreSQL was not started with the "-c" option. I'll look into enabling that before this happens again.
>
> makes sense.
>
>> I'll read more from the other debugging article and see if there is anything I can do there as well. Thanks. There
wereno files generated and dropped in PGDATA this time, unfortunately. Sorry, I know this isn't much to go on, but it
isall I know at this time. There wasn't much else that wasn't routine in the logs before or after the two lines I
pastedbelow other than a bunch of warnings for the the 30 or 40 transactions that were in progress followed by this: 
>
> Okay, I think we can't get anything from these logs. I think once
> core is available, we can try to find the reason, but it would be much
> better if we can generate an independent test to reproduce this
> problem. One possible way could be to find the culprit query. You
> might want to log long-running queries, as parallelism will generally
> be used for such queries.
>
>> 2016-11-11 21:31:26.292 UTC WARNING: terminating connection because of crash of another server process 2016-11-11
21:31:26.292UTC DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit,
becauseanother server process exited abnormally and possibly corrupted shared memory. 2016-11-11 21:31:26.292 UTC HINT:
Ina moment you should be able to reconnect to the database and repeat your command. 2016-11-11 21:31:26.301 UTC
WARNING:terminating connection because of crash of another server process 2016-11-11 21:31:26.301 UTC DETAIL: The
postmasterhas commanded this server process to roll back the current transaction and exit, because another server
processexited abnormally and possibly corrupted shared memory. 2016-11-11 21:31:26.301 UTC HINT: In a moment you should
beable to reconnect to the database and repeat your command. 2016-11-11 21:31:30.762 UTC [unknown] x.x.x.x [unknown]
LOG:connection received: host=x.x.x.x port=47692 
2016-11-11 21:31:30.762 UTC clarivoy x.x.x.x some_user FATAL: the database system is in recovery mode 2016-11-11
21:31:31.766UTC LOG: all server processes terminated; reinitializing 2016-11-11 21:31:33.526 UTC LOG: database system
wasinterrupted; last known up at 2016-11-11 21:29:28 UTC 2016-11-11 21:31:33.660 UTC LOG: database system was not
properlyshut down; automatic recovery in progress 2016-11-11 21:31:33.674 UTC LOG: redo starts at 1DD/4F5A0320
2016-11-1121:31:33.957 UTC LOG: unexpected pageaddr 1DC/16AEC000 in log segment 00000001000001DD00000056, offset
114524162016-11-11 21:31:33.958 UTC LOG: redo done at 1DD/56AEB7F8 2016-11-11 21:31:33.958 UTC LOG: last completed
transactionwas at log time 2016-11-11 21:31:26.07448+00 2016-11-11 21:31:34.705 UTC LOG: MultiXact member wraparound
protectionsare now enabled 2016-11-11 21:31:34.724 UTC LOG: autovacuum launcher started 2016-11-11 21:31:34.725 UTC
LOG:database system is ready to accept connections After that the 
database was pretty much back to normal. Because everything connects from various pgbouncer instances running
elsewhere,they quickly reconnected and started working again without having to restart any applications or services.