Thread: [BUGS] BUG #14781: server process was terminated by signal 11: Segmentationfault
[BUGS] BUG #14781: server process was terminated by signal 11: Segmentationfault
From
maksim_karaba@epam.com
Date:
The following bug has been logged on the website: Bug reference: 14781 Logged by: Maksim Karaba Email address: maksim_karaba@epam.com PostgreSQL version: 9.6.4 Operating system: CentOS Linux release 7.3.1611 (Core) Description: Hi! We are getting error "server process was terminated by signal 11: Segmentation fault" and server goes to recovery mode on production system. It repeats several times a day, always on complicated updates using postgres_fdw. Could you provide any workaround? Version Postgresql 9.6.4 Error from /var/log/messages : bigdatadb kernel: postgres[17467]: segfault at 58 ip 00000000005c2f94 sp 00007ffdc06b2260 error 4 in postgres[400000+616000] bigdatadb abrt-hook-ccpp: Process 17467 (postgres) of user 26 killed by SIGSEGV - dumping core And we have core dump bt list: #0 ExecProcNode (node=0x0) at execProcnode.c:380 result = <optimized out> __func__ = "ExecProcNode" #1 0x00007f534be36554 in postgresRecheckForeignScan (node=<optimized out>, slot=0x6afbf68) at postgres_fdw.c:2059 scanrelid = <optimized out> outerPlan = <optimized out> result =<optimized out> #2 0x00000000005e3b7c in ForeignRecheck (node=node@entry=0x6afba48, slot=slot@entry=0x6afbf68) at nodeForeignscan.c:101 fdwroutine = 0x6a48b88 econtext = 0x6afbb58 #3 0x00000000005ca5a6 in ExecScanFetch (recheckMtd=0x5e3b40 <ForeignRecheck>, accessMtd=0x5e3bb0 <ForeignNext>, node=0x6afba48) at execScan.c:85 slot = <optimized out> scanrelid = <optimized out> estate = <optimized out> #4 ExecScan (node=node@entry=0x6afba48, accessMtd=accessMtd@entry=0x5e3bb0 <ForeignNext>, recheckMtd=recheckMtd@entry=0x5e3b40 <ForeignRecheck>) at execScan.c:180 econtext = 0x6afbb58 qual = 0x0 projInfo = 0x6afc0d0 isDone = ExprSingleResult resultSlot = <optimized out> #5 0x00000000005e3c5f in ExecForeignScan (node=node@entry=0x6afba48) at nodeForeignscan.c:119 No locals. #6 0x00000000005c36a8 in ExecProcNode (node=node@entry=0x6afba48) at execProcnode.c:465 result = <optimized out> __func__ = "ExecProcNode" #7 0x00000000005e0889 in ExecSort (node=node@entry=0x6afb7d8) at nodeSort.c:103 plannode = <optimized out> outerNode = 0x6afba48 tupDesc = <optimized out> estate= 0x6a42dc8 dir = ForwardScanDirection tuplesortstate = 0x6b70868 slot = <optimized out> #8 0x00000000005c3648 in ExecProcNode (node=node@entry=0x6afb7d8) at execProcnode.c:495 result = <optimized out> __func__ = "ExecProcNode" #9 0x00000000005e448e in begin_partition (winstate=winstate@entry=0x6a50b28) at nodeWindowAgg.c:1082 outerslot = <optimized out> outerPlan = 0x6afb7d8 numfuncs = 1 i = <optimized out> #10 0x00000000005e6a4b in ExecWindowAgg (winstate=winstate@entry=0x6a50b28) at nodeWindowAgg.c:1691 result = <optimized out> isDone = ExprSingleResult econtext = <optimized out> i = <optimized out> numfuncs = <optimized out> __func__ = "ExecWindowAgg" #11 0x00000000005c3618 in ExecProcNode (node=0x6a50b28) at execProcnode.c:507 Thank you! -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] BUG #14781: server process was terminated by signal 11: Segmentation fault
From
Tom Lane
Date:
maksim_karaba@epam.com writes: > We are getting error "server process was terminated by signal 11: > Segmentation fault" and server goes to recovery mode on production system. > It repeats several times a day, always on complicated updates using > postgres_fdw. Please show the failing query or queries. (If you don't have postmaster logs showing them, "p debug_query_string" in the core files should get the info.) Also, please show EXPLAIN VERBOSE plans for the query(s), as well as schema information (psql \d output would do) for the referenced tables. https://wiki.postgresql.org/wiki/Guide_to_reporting_problems regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] BUG #14781: server process was terminated by signal 11: Segmentation fault
From
Maksim Karaba
Date:
Thank you for your answer so quickly. >(gdb) p debug_query_string >$1 = 0x2c2e748 "DO $$DECLARE\nBEGIN\n PERFORM dwh.load_trade_record_inc();\nEND$$;" We added logging to the function and found out that the error always occurs on the same update (in attachments) Plan explain verbose in attachments MAKSIM KARABA Senior Systems Engineer, EPAM Office: +375 17 389 0100 x 53194 Cell: +375296772871 Email: maksim_karaba@epam.com Minsk, Belarus (GMT+3) epam.com -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Wednesday, August 16, 2017 4:47 PM To: Maksim Karaba <Maksim_Karaba@epam.com> Cc: pgsql-bugs@postgresql.org Subject: Re: [BUGS] BUG #14781: server process was terminated by signal 11: Segmentation fault maksim_karaba@epam.com writes: > We are getting error "server process was terminated by signal 11: > Segmentation fault" and server goes to recovery mode on production system. > It repeats several times a day, always on complicated updates using > postgres_fdw. Please show the failing query or queries. (If you don't have postmaster logs showing them, "p debug_query_string" in thecore files should get the info.) Also, please show EXPLAIN VERBOSE plans for the query(s), as well as schema information(psql \d output would do) for the referenced tables. https://wiki.postgresql.org/wiki/Guide_to_reporting_problems regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Attachment
Re: [BUGS] BUG #14781: server process was terminated by signal 11: Segmentation fault
From
Tom Lane
Date:
Maksim Karaba <Maksim_Karaba@epam.com> writes: > We added logging to the function and found out that the error always occurs on the same update > (in attachments) > Plan explain verbose in attachments The point of my questions was that it's going to be difficult for anyone to fix this unless we can reproduce the problem. What you've provided doesn't even begin to make that possible. Please see the advice about providing self-contained test cases at https://www.postgresql.org/docs/current/static/bug-reporting.html regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] BUG #14781: server process was terminated by signal 11:Segmentation fault
From
Maksim Karaba
Date:
Unfortunately we cannot reproduce this issue on other servers, only on production system. And we cannot provide internal database info, schema structure and tables info. -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Wednesday, August 16, 2017 6:10 PM To: Maksim Karaba <Maksim_Karaba@epam.com> Cc: pgsql-bugs@postgresql.org Subject: Re: [BUGS] BUG #14781: server process was terminated by signal 11: Segmentation fault The point of my questions was that it's going to be difficult for anyone to fix this unless we can reproduce the problem. What you've provided doesn't even begin to make that possible. Please see the advice about providing self-containedtest cases at https://www.postgresql.org/docs/current/static/bug-reporting.html regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] BUG #14781: server process was terminated by signal 11: Segmentation fault
From
Tom Lane
Date:
Maksim Karaba <Maksim_Karaba@epam.com> writes: > Unfortunately we cannot reproduce this issue on other servers, only on production system. > And we cannot provide internal database info, schema structure and tables info. [ shrug... ] We may just have to wait for somebody to be more forthcoming. FWIW, the stack trace seems to indicate that an incorrect plan has been generated, ie one that has a remote join node without an EPQ recheck subplan. That mistake in itself is probably pretty deterministic. The reason you can't reproduce the crash easily is that the lack of a subplan only manifests as a crash if we enter the EPQ recheck code, and that only happens if the query tries to update a row that's just been updated by some concurrent query. So it's not going to crash except under concurrent load, which probably also explains why the bug wasn't found long ago. If you want to push this forward rather than wait for somebody else to hit the problem, you could try adding something like if (fsplan->scan.scanrelid == 0 && outerPlanState(node) == NULL && (estate->es_plannedstmt->commandType != CMD_SELECT|| estate->es_rowMarks)) elog(WARNING, "foreign join plan lacks EPQ support"); near the beginning of postgresBeginForeignScan and then running your app on a test server. I'm not sure offhand that the estate filters are exactly right, but any statement that produces this warning would be pretty suspect. At that point you could work on sanitizing the query + tables + test data to get to a publishable test case; you could probably boil your real query down quite a bit and still get the failure. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] BUG #14781: server process was terminated by signal 11:Segmentation fault
From
Alvaro Herrera
Date:
Tom Lane wrote: > Maksim Karaba <Maksim_Karaba@epam.com> writes: > > Unfortunately we cannot reproduce this issue on other servers, only on production system. > > And we cannot provide internal database info, schema structure and tables info. > > [ shrug... ] We may just have to wait for somebody to be more > forthcoming. > > FWIW, the stack trace seems to indicate that an incorrect plan has been > generated, ie one that has a remote join node without an EPQ recheck > subplan. That mistake in itself is probably pretty deterministic. The > reason you can't reproduce the crash easily is that the lack of a subplan > only manifests as a crash if we enter the EPQ recheck code, and that only > happens if the query tries to update a row that's just been updated by > some concurrent query. So it's not going to crash except under concurrent > load, which probably also explains why the bug wasn't found long ago. One way to figure out the exact bug is to explore the sequence of WAL records that leads to the tuple causing the crash; it should be possible to create a reproducer by writing an isolationtester script that produces the same WAL sequence. That's how we found the bug fixed in https://git.postgresql.org/pg/commitdiff/459c64d3227f8 for example. > If you want to push this forward rather than wait for somebody else > to hit the problem, you could try adding something like > > if (fsplan->scan.scanrelid == 0 && outerPlanState(node) == NULL && > (estate->es_plannedstmt->commandType != CMD_SELECT || > estate->es_rowMarks)) > elog(WARNING, "foreign join plan lacks EPQ support"); > > near the beginning of postgresBeginForeignScan and then running your app > on a test server. Hmm, is there a reason this cannot be included as a sanity check always? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] BUG #14781: server process was terminated by signal 11: Segmentation fault
From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Tom Lane wrote: >> If you want to push this forward rather than wait for somebody else >> to hit the problem, you could try adding something like >> >> if (fsplan->scan.scanrelid == 0 && outerPlanState(node) == NULL && >> (estate->es_plannedstmt->commandType != CMD_SELECT || >> estate->es_rowMarks)) >> elog(WARNING, "foreign join plan lacks EPQ support"); >> >> near the beginning of postgresBeginForeignScan and then running your app >> on a test server. > Hmm, is there a reason this cannot be included as a sanity check always? That's off-the-cuff rather than something I'm sure is correct. But yeah, I was wondering about pushing something like that into the standard code. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] BUG #14781: server process was terminated by signal 11:Segmentation fault
From
Maksim Karaba
Date:
Thanks for pointing to the possible root cause. Our Dev team finally have figured out what was the reason and fixed it. The reason was in using postgres_fdw based cursors like for scr in (select f1, f2 ... from foreign table (postgre_fdw) ) loop from scr1 in (select .. from foreign table (postgre_fdw) where ... = scr.f1) loop Compicated update of local table using foreign tables as source update of foreign table one record /**/ end loop; end loop; Dev team has refactored it to use loop based on arrays instead of fdw to reduce time of foreign session MAKSIM KARABA -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Wednesday, August 16, 2017 7:57 PM To: Alvaro Herrera <alvherre@2ndquadrant.com> Cc: Maksim Karaba <Maksim_Karaba@epam.com>; pgsql-bugs@postgresql.org Subject: Re: [BUGS] BUG #14781: server process was terminated by signal 11: Segmentation fault Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Tom Lane wrote: >> If you want to push this forward rather than wait for somebody else >> to hit the problem, you could try adding something like >> >> if (fsplan->scan.scanrelid == 0 && outerPlanState(node) == NULL && >> (estate->es_plannedstmt->commandType != CMD_SELECT || >> estate->es_rowMarks)) >> elog(WARNING, "foreign join plan lacks EPQ support"); >> >> near the beginning of postgresBeginForeignScan and then running your >> app on a test server. > Hmm, is there a reason this cannot be included as a sanity check always? That's off-the-cuff rather than something I'm sure is correct. But yeah, I was wondering about pushing something like thatinto the standard code. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs