Thread: [BUGS] BUG #14781: server process was terminated by signal 11: Segmentationfault

[BUGS] BUG #14781: server process was terminated by signal 11: Segmentationfault

From
maksim_karaba@epam.com
Date:
The following bug has been logged on the website:

Bug reference:      14781
Logged by:          Maksim Karaba
Email address:      maksim_karaba@epam.com
PostgreSQL version: 9.6.4
Operating system:   CentOS Linux release 7.3.1611 (Core)
Description:

Hi!
We are getting error "server process was terminated by signal 11:
Segmentation fault" and server goes to recovery mode on production system.
It repeats several times a day, always on complicated updates using
postgres_fdw. Could you provide any workaround? Version Postgresql 9.6.4
Error from /var/log/messages :
bigdatadb kernel: postgres[17467]: segfault at 58 ip 00000000005c2f94 sp
00007ffdc06b2260 error 4 in postgres[400000+616000]
bigdatadb abrt-hook-ccpp: Process 17467 (postgres) of user 26 killed by
SIGSEGV - dumping core

And we have core dump bt list:

#0  ExecProcNode (node=0x0) at execProcnode.c:380       result = <optimized out>       __func__ = "ExecProcNode"
#1  0x00007f534be36554 in postgresRecheckForeignScan (node=<optimized out>,
slot=0x6afbf68) at postgres_fdw.c:2059       scanrelid = <optimized out>       outerPlan = <optimized out>       result
=<optimized out>
 
#2  0x00000000005e3b7c in ForeignRecheck (node=node@entry=0x6afba48,
slot=slot@entry=0x6afbf68) at nodeForeignscan.c:101       fdwroutine = 0x6a48b88       econtext = 0x6afbb58
#3  0x00000000005ca5a6 in ExecScanFetch (recheckMtd=0x5e3b40
<ForeignRecheck>, accessMtd=0x5e3bb0 <ForeignNext>, node=0x6afba48) at
execScan.c:85       slot = <optimized out>       scanrelid = <optimized out>       estate = <optimized out>
#4  ExecScan (node=node@entry=0x6afba48, accessMtd=accessMtd@entry=0x5e3bb0
<ForeignNext>, recheckMtd=recheckMtd@entry=0x5e3b40 <ForeignRecheck>) at
execScan.c:180       econtext = 0x6afbb58       qual = 0x0       projInfo = 0x6afc0d0       isDone = ExprSingleResult
   resultSlot = <optimized out>
 
#5  0x00000000005e3c5f in ExecForeignScan (node=node@entry=0x6afba48) at
nodeForeignscan.c:119
No locals.
#6  0x00000000005c36a8 in ExecProcNode (node=node@entry=0x6afba48) at
execProcnode.c:465       result = <optimized out>       __func__ = "ExecProcNode"
#7  0x00000000005e0889 in ExecSort (node=node@entry=0x6afb7d8) at
nodeSort.c:103       plannode = <optimized out>       outerNode = 0x6afba48       tupDesc = <optimized out>
estate= 0x6a42dc8       dir = ForwardScanDirection       tuplesortstate = 0x6b70868       slot = <optimized out>
 
#8  0x00000000005c3648 in ExecProcNode (node=node@entry=0x6afb7d8) at
execProcnode.c:495       result = <optimized out>       __func__ = "ExecProcNode"
#9  0x00000000005e448e in begin_partition
(winstate=winstate@entry=0x6a50b28) at nodeWindowAgg.c:1082       outerslot = <optimized out>       outerPlan =
0x6afb7d8      numfuncs = 1       i = <optimized out>
 
#10 0x00000000005e6a4b in ExecWindowAgg (winstate=winstate@entry=0x6a50b28)
at nodeWindowAgg.c:1691       result = <optimized out>       isDone = ExprSingleResult       econtext = <optimized out>
     i = <optimized out>       numfuncs = <optimized out>       __func__ = "ExecWindowAgg"
 
#11 0x00000000005c3618 in ExecProcNode (node=0x6a50b28) at
execProcnode.c:507

Thank you!


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

maksim_karaba@epam.com writes:
> We are getting error "server process was terminated by signal 11:
> Segmentation fault" and server goes to recovery mode on production system.
> It repeats several times a day, always on complicated updates using
> postgres_fdw.

Please show the failing query or queries.  (If you don't have postmaster
logs showing them, "p debug_query_string" in the core files should get
the info.)  Also, please show EXPLAIN VERBOSE plans for the query(s),
as well as schema information (psql \d output would do) for the
referenced tables.

https://wiki.postgresql.org/wiki/Guide_to_reporting_problems
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Thank you for your answer so quickly.

>(gdb) p debug_query_string
>$1 = 0x2c2e748 "DO $$DECLARE\nBEGIN\n PERFORM dwh.load_trade_record_inc();\nEND$$;"

We added logging to the function and found out that the error always occurs on the same update
(in attachments)
Plan explain verbose in attachments


MAKSIM KARABA
Senior Systems Engineer, EPAM
Office: +375 17 389 0100 x 53194   Cell: +375296772871   Email: maksim_karaba@epam.com
Minsk, Belarus (GMT+3)   epam.com



-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Wednesday, August 16, 2017 4:47 PM
To: Maksim Karaba <Maksim_Karaba@epam.com>
Cc: pgsql-bugs@postgresql.org
Subject: Re: [BUGS] BUG #14781: server process was terminated by signal 11: Segmentation fault

maksim_karaba@epam.com writes:
> We are getting error "server process was terminated by signal 11:
> Segmentation fault" and server goes to recovery mode on production system.
> It repeats several times a day, always on complicated updates using
> postgres_fdw.

Please show the failing query or queries.  (If you don't have postmaster logs showing them, "p debug_query_string" in
thecore files should get the info.)  Also, please show EXPLAIN VERBOSE plans for the query(s), as well as schema
information(psql \d output would do) for the referenced tables. 

https://wiki.postgresql.org/wiki/Guide_to_reporting_problems

            regards, tom lane

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Attachment
Maksim Karaba <Maksim_Karaba@epam.com> writes:
> We added logging to the function and found out that the error always occurs on the same update
> (in attachments)
> Plan explain verbose in attachments

The point of my questions was that it's going to be difficult for anyone
to fix this unless we can reproduce the problem.  What you've provided
doesn't even begin to make that possible.  Please see the advice about
providing self-contained test cases at
https://www.postgresql.org/docs/current/static/bug-reporting.html
        regards, tom lane


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Unfortunately we cannot reproduce this issue on other servers, only on production system.
And we cannot provide internal database info, schema structure and tables info.




-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Wednesday, August 16, 2017 6:10 PM
To: Maksim Karaba <Maksim_Karaba@epam.com>
Cc: pgsql-bugs@postgresql.org
Subject: Re: [BUGS] BUG #14781: server process was terminated by signal 11: Segmentation fault


The point of my questions was that it's going to be difficult for anyone to fix this unless we can reproduce the
problem. What you've provided doesn't even begin to make that possible.  Please see the advice about providing
self-containedtest cases at https://www.postgresql.org/docs/current/static/bug-reporting.html 
        regards, tom lane


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Maksim Karaba <Maksim_Karaba@epam.com> writes:
> Unfortunately we cannot reproduce this issue on other servers, only on production system.
> And we cannot provide internal database info, schema structure and tables info.

[ shrug... ]  We may just have to wait for somebody to be more
forthcoming.

FWIW, the stack trace seems to indicate that an incorrect plan has been
generated, ie one that has a remote join node without an EPQ recheck
subplan.  That mistake in itself is probably pretty deterministic.  The
reason you can't reproduce the crash easily is that the lack of a subplan
only manifests as a crash if we enter the EPQ recheck code, and that only
happens if the query tries to update a row that's just been updated by
some concurrent query.  So it's not going to crash except under concurrent
load, which probably also explains why the bug wasn't found long ago.

If you want to push this forward rather than wait for somebody else
to hit the problem, you could try adding something like
if (fsplan->scan.scanrelid == 0 && outerPlanState(node) == NULL &&    (estate->es_plannedstmt->commandType !=
CMD_SELECT||     estate->es_rowMarks))    elog(WARNING, "foreign join plan lacks EPQ support"); 

near the beginning of postgresBeginForeignScan and then running your app
on a test server.  I'm not sure offhand that the estate filters are
exactly right, but any statement that produces this warning would be
pretty suspect.  At that point you could work on sanitizing the query +
tables + test data to get to a publishable test case; you could probably
boil your real query down quite a bit and still get the failure.
        regards, tom lane


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Tom Lane wrote:
> Maksim Karaba <Maksim_Karaba@epam.com> writes:
> > Unfortunately we cannot reproduce this issue on other servers, only on production system.
> > And we cannot provide internal database info, schema structure and tables info.
> 
> [ shrug... ]  We may just have to wait for somebody to be more
> forthcoming.
> 
> FWIW, the stack trace seems to indicate that an incorrect plan has been
> generated, ie one that has a remote join node without an EPQ recheck
> subplan.  That mistake in itself is probably pretty deterministic.  The
> reason you can't reproduce the crash easily is that the lack of a subplan
> only manifests as a crash if we enter the EPQ recheck code, and that only
> happens if the query tries to update a row that's just been updated by
> some concurrent query.  So it's not going to crash except under concurrent
> load, which probably also explains why the bug wasn't found long ago.

One way to figure out the exact bug is to explore the sequence of WAL
records that leads to the tuple causing the crash; it should be possible
to create a reproducer by writing an isolationtester script that
produces the same WAL sequence.  That's how we found the bug fixed in
https://git.postgresql.org/pg/commitdiff/459c64d3227f8 for example.

> If you want to push this forward rather than wait for somebody else
> to hit the problem, you could try adding something like
> 
>     if (fsplan->scan.scanrelid == 0 && outerPlanState(node) == NULL &&
>         (estate->es_plannedstmt->commandType != CMD_SELECT ||
>          estate->es_rowMarks))
>         elog(WARNING, "foreign join plan lacks EPQ support");
> 
> near the beginning of postgresBeginForeignScan and then running your app
> on a test server.

Hmm, is there a reason this cannot be included as a sanity check always?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> If you want to push this forward rather than wait for somebody else
>> to hit the problem, you could try adding something like
>> 
>> if (fsplan->scan.scanrelid == 0 && outerPlanState(node) == NULL &&
>> (estate->es_plannedstmt->commandType != CMD_SELECT ||
>> estate->es_rowMarks))
>> elog(WARNING, "foreign join plan lacks EPQ support");
>> 
>> near the beginning of postgresBeginForeignScan and then running your app
>> on a test server.

> Hmm, is there a reason this cannot be included as a sanity check always?

That's off-the-cuff rather than something I'm sure is correct.  But
yeah, I was wondering about pushing something like that into the
standard code.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14781: server process was terminated by signal 11:Segmentation fault

From
Maksim Karaba
Date:
Thanks for pointing to the possible root cause.
Our Dev team finally have figured out what was the reason and fixed it.
The reason was in using postgres_fdw based cursors like

for scr in (select f1, f2 ... from foreign table (postgre_fdw) ) loop
from scr1 in (select .. from foreign table (postgre_fdw) where ... = scr.f1) loop

Compicated update of local table using foreign tables as source

update of foreign table one record /**/
end loop;
end loop;

Dev team has refactored it to use loop based on arrays instead of fdw to reduce time of foreign session


MAKSIM KARABA


-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Wednesday, August 16, 2017 7:57 PM
To: Alvaro Herrera <alvherre@2ndquadrant.com>
Cc: Maksim Karaba <Maksim_Karaba@epam.com>; pgsql-bugs@postgresql.org
Subject: Re: [BUGS] BUG #14781: server process was terminated by signal 11: Segmentation fault

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> If you want to push this forward rather than wait for somebody else
>> to hit the problem, you could try adding something like
>>
>> if (fsplan->scan.scanrelid == 0 && outerPlanState(node) == NULL &&
>> (estate->es_plannedstmt->commandType != CMD_SELECT ||
>> estate->es_rowMarks))
>> elog(WARNING, "foreign join plan lacks EPQ support");
>>
>> near the beginning of postgresBeginForeignScan and then running your
>> app on a test server.

> Hmm, is there a reason this cannot be included as a sanity check always?

That's off-the-cuff rather than something I'm sure is correct.  But yeah, I was wondering about pushing something like
thatinto the standard code. 
        regards, tom lane


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs