Thread: Segmentation Fault in logical decoding get/peek API

Segmentation Fault in logical decoding get/peek API

From
Sudalai
Date:
Hi everyone, 

PostgreSQL backend crashing with segmentation fault 11, while consuming big
transaction changes using pg_logical_slot_(peek|get)_changes.  We are using
postgresql version 9.5.10.

About the big transaction : pg_xlog_dump shows 161426 records, and it spread
across 3 WAL files. 

BackTace of the backend : 

Program received signal SIGSEGV, Segmentation fault.
ReorderBufferCommit (rb=0xd17dc8, xid=<value optimized out>,
commit_lsn=7224098593984, end_lsn=<value optimized out>, commit_time=<value
optimized out>, origin_id=<value optimized out>, origin_lsn=0) at
/home/test/compile/../postgresql-9.5.10/src/backend/replication/logical/reorderbuffer.c:1380
1380                        change->action = REORDER_BUFFER_CHANGE_INSERT;

#0  ReorderBufferCommit (rb=0xd17dc8, xid=<value optimized out>,
commit_lsn=7224098593984, end_lsn=<value optimized out>, commit_time=<value
optimized out>, origin_id=<value optimized out>, origin_lsn=0) at
/home/test/compile/../postgresql-9.5.10/src/backend/replication/logical/reorderbuffer.c:1380
#1  0x0000000000675cd1 in DecodeCommit (ctx=0xcffdb8, record=<value
optimized out>) at
/home/test/compile/../postgresql-9.5.10/src/backend/replication/logical/decode.c:549
#2  DecodeXactOp (ctx=0xcffdb8, record=<value optimized out>) at
/home/test/compile/../postgresql-9.5.10/src/backend/replication/logical/decode.c:234
#3  LogicalDecodingProcessRecord (ctx=0xcffdb8, record=<value optimized
out>) at
/home/test/compile/../postgresql-9.5.10/src/backend/replication/logical/decode.c:111
#4  0x00000000006772e5 in pg_logical_slot_get_changes_guts
(fcinfo=0x7ffea1ef0b90, confirm=0 '\000', binary=0 '\000') at
/home/test/compile/../postgresql-9.5.10/src/backend/replication/logical/logicalfuncs.c:442
#5  0x00000000005c4074 in ExecMakeTableFunctionResult (funcexpr=0xce55a0,
econtext=0xce5190, argContext=<value optimized out>, expectedDesc=0xce6360,
randomAccess=0 '\000') at
/home/test/compile/../postgresql-9.5.10/src/backend/executor/execQual.c:2189
#6  0x00000000005d8652 in FunctionNext (node=0xce5080) at
/home/test/compile/../postgresql-9.5.10/src/backend/executor/nodeFunctionscan.c:95
#7  0x00000000005c4d73 in ExecScanFetch (node=0xce5080, accessMtd=0x5d83a0
<FunctionNext>, recheckMtd=0x5d7cd0 <FunctionRecheck>) at
/home/test/compile/../postgresql-9.5.10/src/backend/executor/execScan.c:95
#8  ExecScan (node=0xce5080, accessMtd=0x5d83a0 <FunctionNext>,
recheckMtd=0x5d7cd0 <FunctionRecheck>) at
/home/test/compile/../postgresql-9.5.10/src/backend/executor/execScan.c:145
#9  0x00000000005bd808 in ExecProcNode (node=0xce5080) at
/home/test/compile/../postgresql-9.5.10/src/backend/executor/execProcnode.c:442
#10 0x00000000005caca6 in fetch_input_tuple (aggstate=0xce4808) at
/home/test/compile/../postgresql-9.5.10/src/backend/executor/nodeAgg.c:487
#11 0x00000000005cc286 in agg_retrieve_direct (node=<value optimized out>)
at
/home/test/compile/../postgresql-9.5.10/src/backend/executor/nodeAgg.c:1683
#12 ExecAgg (node=<value optimized out>) at
/home/test/compile/../postgresql-9.5.10/src/backend/executor/nodeAgg.c:1508
#13 0x00000000005bd748 in ExecProcNode (node=0xce4808) at
/home/test/compile/../postgresql-9.5.10/src/backend/executor/execProcnode.c:496
#14 0x00000000005bbfdf in ExecutePlan (queryDesc=0xce93a8, direction=<value
optimized out>, count=0) at
/home/test/compile/../postgresql-9.5.10/src/backend/executor/execMain.c:1549
#15 standard_ExecutorRun (queryDesc=0xce93a8, direction=<value optimized
out>, count=0) at
/home/test/compile/../postgresql-9.5.10/src/backend/executor/execMain.c:337
#16 0x00000000006c1f57 in PortalRunSelect (portal=0xd2b9c8, forward=<value
optimized out>, count=0, dest=<value optimized out>) at
/home/test/compile/../postgresql-9.5.10/src/backend/tcop/pquery.c:942
#17 0x00000000006c31c7 in PortalRun (portal=0xd2b9c8,
count=9223372036854775807, isTopLevel=1 '\001', dest=0xced478,
altdest=0xced478, completionTag=0x7ffea1ef1440 "") at
/home/test/compile/../postgresql-9.5.10/src/backend/tcop/pquery.c:786
#18 0x00000000006bff50 in exec_simple_query (query_string=0xccd7e8 "select
count(*) from pg_logical_slot_peek_changes('pg_slotnew',null,1000);") at
/home/test/compile/../postgresql-9.5.10/src/backend/tcop/postgres.c:1096
#19 0x00000000006c157c in PostgresMain (argc=<value optimized out>,
argv=<value optimized out>, dbname=0xc36100 "sasdb", username=<value
optimized out>) at
/home/test/compile/../postgresql-9.5.10/src/backend/tcop/postgres.c:4049
#20 0x0000000000665d1c in BackendRun (argc=<value optimized out>,
argv=<value optimized out>) at
/home/test/compile/../postgresql-9.5.10/src/backend/postmaster/postmaster.c:4264
#21 BackendStartup (argc=<value optimized out>, argv=<value optimized out>)
at
/home/test/compile/../postgresql-9.5.10/src/backend/postmaster/postmaster.c:3938
#22 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at
/home/test/compile/../postgresql-9.5.10/src/backend/postmaster/postmaster.c:1703
#23 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>)
at
/home/test/compile/../postgresql-9.5.10/src/backend/postmaster/postmaster.c:1311
#24 0x00000000005f4309 in main (argc=3, argv=0xc34b60) at
/home/test/compile/../postgresql-9.5.10/src/backend/main/main.c:228


Is it a bug? Is any way i could avoid this?
If you guys need any information for debugging, kindly let me know.

Thanks in advance,


Sudalai




-----
sudalai
--
Sent from: http://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html


Re: Segmentation Fault in logical decoding get/peek API

From
Sudalai
Date:
Kindly help me resolve the issue.



-----
sudalai
--
Sent from: http://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html


Re: Segmentation Fault in logical decoding get/peek API

From
Sudalai
Date:
Kindly help me resolve this issue. 



-----
sudalai
--
Sent from: http://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html


Re: Segmentation Fault in logical decoding get/peek API

From
Andres Freund
Date:
On 2018-02-20 04:00:07 -0700, Sudalai wrote:
> Kindly help me resolve the issue.

Please don't spam the list with daily requests for status
updates. Especially not two within minutes, less than a day after the
original report.

Greetings,

Andres Freund


Re: Segmentation Fault in logical decoding get/peek API

From
Andres Freund
Date:
Hi,

On 2018-02-19 05:20:14 -0700, Sudalai wrote:
> PostgreSQL backend crashing with segmentation fault 11, while consuming big
> transaction changes using pg_logical_slot_(peek|get)_changes.  We are using
> postgresql version 9.5.10.

Just to be sure, could you check that 9.5.11 shows the same problem?

> About the big transaction : pg_xlog_dump shows 161426 records, and it spread
> across 3 WAL files. 
> 
> BackTace of the backend : 
> 
> Program received signal SIGSEGV, Segmentation fault.
> ReorderBufferCommit (rb=0xd17dc8, xid=<value optimized out>,
> commit_lsn=7224098593984, end_lsn=<value optimized out>, commit_time=<value
> optimized out>, origin_id=<value optimized out>, origin_lsn=0) at
> /home/test/compile/../postgresql-9.5.10/src/backend/replication/logical/reorderbuffer.c:1380
> 1380                        change->action = REORDER_BUFFER_CHANGE_INSERT;
>
> #0  ReorderBufferCommit (rb=0xd17dc8, xid=<value optimized out>,
> commit_lsn=7224098593984, end_lsn=<value optimized out>, commit_time=<value
> optimized out>, origin_id=<value optimized out>, origin_lsn=0) at
> /home/test/compile/../postgresql-9.5.10/src/backend/replication/logical/reorderbuffer.c:1380

Unfortunately this does not seems to provide enough information to debug
the problem. Could you develop a reproducer we can run locally?

- Andres


Re: Segmentation Fault in logical decoding get/peek API

From
Tomas Vondra
Date:

On 02/20/2018 06:55 PM, Andres Freund wrote:
> Hi,
> 
> On 2018-02-19 05:20:14 -0700, Sudalai wrote:
>> PostgreSQL backend crashing with segmentation fault 11, while consuming big
>> transaction changes using pg_logical_slot_(peek|get)_changes.  We are using
>> postgresql version 9.5.10.
> 
> Just to be sure, could you check that 9.5.11 shows the same problem?
> 
>> About the big transaction : pg_xlog_dump shows 161426 records, and it spread
>> across 3 WAL files. 
>>
>> BackTace of the backend : 
>>
>> Program received signal SIGSEGV, Segmentation fault.
>> ReorderBufferCommit (rb=0xd17dc8, xid=<value optimized out>,
>> commit_lsn=7224098593984, end_lsn=<value optimized out>, commit_time=<value
>> optimized out>, origin_id=<value optimized out>, origin_lsn=0) at
>> /home/test/compile/../postgresql-9.5.10/src/backend/replication/logical/reorderbuffer.c:1380

This very much seems like a custom PostgreSQL build (judging by the
paths in the backtrace), so I wonder what configure options were used.
It's not clear if this is a production of testing evironment, but if you
could rebuild with debugging symbols / without optimizations, that would
be helpful (it would get rid of some of the "optimized out" stuff).

>> 1380                        change->action = REORDER_BUFFER_CHANGE_INSERT;
>>

Hmmm, this would mean it's failing here:

    case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:

        /*
         * Confirmation for speculative insertion arrived. Simply
         * use as a normal record. It'll be cleaned up at the end
         * of INSERT processing.
         */
        Assert(specinsert->data.tp.oldtuple == NULL);
        change = specinsert;
        change->action = REORDER_BUFFER_CHANGE_INSERT;

And if it's segfaulting, it has to mean specinsert is NULL. So either we
never got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT, or we threw it
away in the "change_done" part. Seems strange in both cases.

Sudalai, are you using speculative inserts in the transaction?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Segmentation Fault in logical decoding get/peek API

From
Peter Geoghegan
Date:
On Tue, Feb 20, 2018 at 1:35 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> And if it's segfaulting, it has to mean specinsert is NULL. So either we
> never got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT, or we threw it
> away in the "change_done" part. Seems strange in both cases.
>
> Sudalai, are you using speculative inserts in the transaction?

He means ON CONFLICT DO UPDATE.  :-)

The speculative insertion case handling in reorderbuffer.c is likely
something that is hardly ever used, since speculative insertion
conflicts are presumably generally very rare. Though at the same time,
they're surely very common for certain workloads (such as my testing
script for ON CONFLICT DO UPDATE [1]).

[1] https://github.com/petergeoghegan/jjanes_upsert
-- 
Peter Geoghegan


Re: Segmentation Fault in logical decoding get/peek API

From
Sudalai
Date:

>>And if it's segfaulting, it has to mean specinsert is NULL. So either we 
>>never got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT, or we threw it 
>>away in the "change_done" part. Seems strange in both cases. 

Yes, specinsert is NULL.  

>> Sudalai, are you using speculative inserts in the transaction? 
Yes . We have done ON CONFLICT  DO NOTHING . 


Is it correct to add below check ,
      if(specinsert == NULL ){
         goto change_done;
       }
before ,  Assert(specinsert->data.tp.oldtuple == NULL);  to fix segfault ?

Is specinsert is NULL, because of CONFLICT DO NOTHING ? 

Thanks for the response Tomas, Peter.




-----
sudalai
--
Sent from: http://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html


Re: Segmentation Fault in logical decoding get/peek API

From
Tomas Vondra
Date:

On 02/21/2018 08:18 AM, Sudalai wrote:
> 
> 
>>> And if it's segfaulting, it has to mean specinsert is NULL. So either we 
>>> never got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT, or we threw it 
>>> away in the "change_done" part. Seems strange in both cases. 
> 
> Yes, specinsert is NULL.  
> 
>>> Sudalai, are you using speculative inserts in the transaction? 
> Yes . We have done ON CONFLICT  DO NOTHING . 
> 
> 
> Is it correct to add below check ,
>       if(specinsert == NULL ){
>          goto change_done;
>        }
> before ,  Assert(specinsert->data.tp.oldtuple == NULL);  to fix segfault ?
> 
That seems unlikely to fix the root cause - more likely it's going to
mask it in some strange way. We need to understand why it does happen
(i.e. specinsert is not expected to be NULL here).

It would really help if you could provide a self-contained test case
demonstrating the issue, as mentioned by Andres.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Segmentation Fault in logical decoding get/peek API

From
jfinzel
Date:
We ran into a segmentation fault using peek_changes that appears identical
with what is described in this unfinished thread.  We are running pg 10.3,
and the segfault was fixed by us upgrading to 10.6.  However, we could not
find any clearly related fixes in any of the release notes for 10.4-6
relating to this bug fix.  I did find this commit that I believe fixes the
issue:
https://github.com/postgres/postgres/commit/bba8c612117416907f332fce8b9e80b748e0b798

If this indeed fixes a critical issue as we think it does, could someone
please add it in the right place to the release notes?

Would it be useful to anyone for us to provide any more diagnostics for how
we encountered this bug?  I have a core dump and stack trace.  We
encountered it during a peek_changes being run and fetched via a cursor. 
The same peek_changes from the same LSN actually succeeds after recovery. 
specinsert is null here:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000000000084a914 in ReorderBufferCommit (rb=0x1e236c0, xid=316524727,
commit_lsn=9825880784664, end_lsn=9825880787976,
commit_time=603381553004372, origin_id=0, origin_lsn=0) at
reorderbuffer.c:1395
1395                        Assert(specinsert->data.tp.oldtuple == NULL);

Some interesting notes are that we are doing INSERT ON CONFLICT UPDATE (we
are not using DO NOTHING), and the table of interest has a btree gist index. 
I am certain that the decoding of this SQL process is somehow related to
this segfault.

Another note of interest is that we have been decoding using pglogical from
this system for some time with no issues, so this is definitely unique to
the use of the peek_changes function.

Please advise, thanks!



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html


Re: Segmentation Fault in logical decoding get/peek API

From
Tom Lane
Date:
jfinzel <finzelj@gmail.com> writes:
> We ran into a segmentation fault using peek_changes that appears identical
> with what is described in this unfinished thread.  We are running pg 10.3,
> and the segfault was fixed by us upgrading to 10.6.  However, we could not
> find any clearly related fixes in any of the release notes for 10.4-6
> relating to this bug fix.  I did find this commit that I believe fixes the
> issue:
> https://github.com/postgres/postgres/commit/bba8c612117416907f332fce8b9e80b748e0b798

> If this indeed fixes a critical issue as we think it does, could someone
> please add it in the right place to the release notes?

You'd have to provide some evidence that that commit actually had
something to do with a user-visible problem.  It certainly wouldn't,
by itself, have completely fixed any problem: at best it'd have
transformed it from a crash into an elog(ERROR).  More to the point,
that commit was intended to silence a probably-hypothetical-anyway
warning about the prior commit, so it wasn't fixing any issue that
ever saw the light of day in a PG release.

So my bet is that your problem was fixed by some other commit between
10.3 and 10.6.  Maybe the predecessor one, b767b3f2e; but hard to say
without more investigation than seems warranted, if the bug's gone.

            regards, tom lane


Re: Segmentation Fault in logical decoding get/peek API

From
Peter Geoghegan
Date:
On Thu, Feb 14, 2019 at 2:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So my bet is that your problem was fixed by some other commit between
> 10.3 and 10.6.  Maybe the predecessor one, b767b3f2e; but hard to say
> without more investigation than seems warranted, if the bug's gone.

I wonder if jfinzel is in a position to do a git-bisect. That might be
interesting.

-- 
Peter Geoghegan


Re: Segmentation Fault in logical decoding get/peek API

From
Jeremy Finzel
Date:


On Thu, Feb 14, 2019 at 4:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
jfinzel <finzelj@gmail.com> writes:

You'd have to provide some evidence that that commit actually had
something to do with a user-visible problem.  It certainly wouldn't,
by itself, have completely fixed any problem: at best it'd have
transformed it from a crash into an elog(ERROR).  More to the point,
that commit was intended to silence a probably-hypothetical-anyway
warning about the prior commit, so it wasn't fixing any issue that
ever saw the light of day in a PG release.

And indeed I have not seen such error messages so I suspect you are right there.


So my bet is that your problem was fixed by some other commit between
10.3 and 10.6.  Maybe the predecessor one, b767b3f2e; but hard to say
without more investigation than seems warranted, if the bug's gone.

                        regards, tom lane
I am willing to put in more time debugging this because we want to know which clusters are actually susceptible to the bug. Any suggestions to proceed are welcome.

Thanks,
Jeremy 

Re: Segmentation Fault in logical decoding get/peek API

From
Tom Lane
Date:
Jeremy Finzel <finzelj@gmail.com> writes:
> On Thu, Feb 14, 2019 at 4:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So my bet is that your problem was fixed by some other commit between
>> 10.3 and 10.6.  Maybe the predecessor one, b767b3f2e; but hard to say
>> without more investigation than seems warranted, if the bug's gone.

> I am willing to put in more time debugging this because we want to know
> which clusters are actually susceptible to the bug. Any suggestions to
> proceed are welcome.

Well, as Peter said, "git bisect" and trying to reproduce the problem
at each step would be the way to prove it definitively.  Seems mighty
tedious though.  Possibly you could shave some time off the process
by assuming it must have been one of the commits that touched
reorderbuffer.c ... a quick check says there have been ten of those
in the v10 branch since 10.3.

            regards, tom lane


Re: Segmentation Fault in logical decoding get/peek API

From
Jeremy Finzel
Date:
Well, as Peter said, "git bisect" and trying to reproduce the problem
at each step would be the way to prove it definitively.  Seems mighty
tedious though.  Possibly you could shave some time off the process
by assuming it must have been one of the commits that touched
reorderbuffer.c ... a quick check says there have been ten of those
in the v10 branch since 10.3.

Update:
  • I definitely got the same segfault on a commit after 10.4 - 0bb28ca
  • I am now getting a different segfault on 10.5 - but I need another set of eyes to verify I am not compiling it wrong
After decoding successfully for awhile, now I get an immediate segfault upon peek_changes.  First of all, here is the backtrace:

$ sudo -u postgres gdb -q -c /san/<cluster>/pgdata/core /usr/lib/postgresql/10.5/bin/postgres
Reading symbols from /usr/lib/postgresql/10.5/bin/postgres...done.
[New LWP 22699]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: <cluster>: jfinzel foo_db 10.7.111.37(52316) FETCH'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007eff42d54428 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007eff42d54428 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007eff42d5602a in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x0000000000a45f9c in ExceptionalCondition (conditionName=0xc2d688 "!(prev_first_lsn < cur_txn->first_lsn)", errorType=0xc2d404 "FailedAssertion", fileName=0xc2d478 "reorderbuffer.c", lineNumber=688) at assert.c:54
#3  0x000000000084b0ac in AssertTXNLsnOrder (rb=0x28ed790) at reorderbuffer.c:688
#4  0x000000000084ab97 in ReorderBufferTXNByXid (rb=0x28ed790, xid=319299822, create=1 '\001', is_new=0x0, lsn=9888781386112, create_as_top=1 '\001') at reorderbuffer.c:567
#5  0x000000000084d86c in ReorderBufferAddNewTupleCids (rb=0x28ed790, xid=319299822, lsn=9888781386112, node=..., tid=..., cmin=2, cmax=4294967295, combocid=4294967295) at reorderbuffer.c:2053
#6  0x00000000008522b6 in SnapBuildProcessNewCid (builder=0x28f57c0, xid=319299827, lsn=9888781386112, xlrec=0x2821c08) at snapbuild.c:780
#7  0x000000000083f280 in DecodeHeap2Op (ctx=0x28dd720, buf=0x7ffc5b73e2d0) at decode.c:371
#8  0x000000000083ebb1 in LogicalDecodingProcessRecord (ctx=0x28dd720, record=0x28dd9e0) at decode.c:121
#9  0x0000000000844f86 in pg_logical_slot_get_changes_guts (fcinfo=0x7ffc5b73e600, confirm=0 '\000', binary=0 '\000') at logicalfuncs.c:308
#10 0x000000000084514d in pg_logical_slot_peek_changes (fcinfo=0x7ffc5b73e600) at logicalfuncs.c:381
#11 0x00000000006f7973 in ExecMakeTableFunctionResult (setexpr=0x28265b8, econtext=0x28262b0, argContext=0x28b4af0, expectedDesc=0x28d1d20, randomAccess=4 '\004') at execSRF.c:231
#12 0x000000000070a870 in FunctionNext (node=0x2826198) at nodeFunctionscan.c:94
#13 0x00000000006f6f6e in ExecScanFetch (node=0x2826198, accessMtd=0x70a7b9 <FunctionNext>, recheckMtd=0x70aba1 <FunctionRecheck>) at execScan.c:97
#14 0x00000000006f6fdd in ExecScan (node=0x2826198, accessMtd=0x70a7b9 <FunctionNext>, recheckMtd=0x70aba1 <FunctionRecheck>) at execScan.c:147
#15 0x000000000070abef in ExecFunctionScan (pstate=0x2826198) at nodeFunctionscan.c:270
#16 0x00000000006f541a in ExecProcNodeFirst (node=0x2826198) at execProcnode.c:430
#17 0x00000000006ed5af in ExecProcNode (node=0x2826198) at ../../../src/include/executor/executor.h:250
#18 0x00000000006effaf in ExecutePlan (estate=0x2825f80, planstate=0x2826198, use_parallel_mode=0 '\000', operation=CMD_SELECT, sendTuples=1 '\001', numberTuples=2000, direction=ForwardScanDirection, dest=0x27ffc78, execute_once=0 '\000') at execMain.c:1722
#19 0x00000000006edbc4 in standard_ExecutorRun (queryDesc=0x2825130, direction=ForwardScanDirection, count=2000, execute_once=0 '\000') at execMain.c:363
#20 0x00000000006ed9de in ExecutorRun (queryDesc=0x2825130, direction=ForwardScanDirection, count=2000, execute_once=0 '\000') at execMain.c:306
#21 0x00000000008d0dd7 in PortalRunSelect (portal=0x27f70a8, forward=1 '\001', count=2000, dest=0x27ffc78) at pquery.c:932
#22 0x00000000008d200f in DoPortalRunFetch (portal=0x27f70a8, fdirection=FETCH_FORWARD, count=2000, dest=0x27ffc78) at pquery.c:1675
#23 0x00000000008d19df in PortalRunFetch (portal=0x27f70a8, fdirection=FETCH_FORWARD, count=2000, dest=0x27ffc78) at pquery.c:1434
#24 0x00000000006833bb in PerformPortalFetch (stmt=0x2888570, dest=0x27ffc78, completionTag=0x7ffc5b73f0f0 "") at portalcmds.c:199
#25 0x00000000008d2ab6 in standard_ProcessUtility (pstmt=0x28888d0, queryString=0x2887b30 "FETCH FORWARD 2000 FROM crash_dude;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x27ffc78, completionTag=0x7ffc5b73f0f0 "") at utility.c:527
#26 0x00007eff42829eb6 in pglogical_ProcessUtility (pstmt=0x28888d0, queryString=0x2887b30 "FETCH FORWARD 2000 FROM crash_dude;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x27ffc78, completionTag=0x7ffc5b73f0f0 "") at pglogical_executor.c:279
#27 0x00000000008d2547 in ProcessUtility (pstmt=0x28888d0, queryString=0x2887b30 "FETCH FORWARD 2000 FROM crash_dude;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x27ffc78, completionTag=0x7ffc5b73f0f0 "") at utility.c:353
#28 0x00000000008d141b in PortalRunUtility (portal=0x27f6f90, pstmt=0x28888d0, isTopLevel=1 '\001', setHoldSnapshot=1 '\001', dest=0x27ffc78, completionTag=0x7ffc5b73f0f0 "") at pquery.c:1178
#29 0x00000000008d1119 in FillPortalStore (portal=0x27f6f90, isTopLevel=1 '\001') at pquery.c:1038
#30 0x00000000008d09a1 in PortalRun (portal=0x27f6f90, count=9223372036854775807, isTopLevel=1 '\001', run_once=1 '\001', dest=0x28889c8, altdest=0x28889c8, completionTag=0x7ffc5b73f350 "") at pquery.c:768
#31 0x00000000008c9f67 in exec_simple_query (query_string=0x2887b30 "FETCH FORWARD 2000 FROM crash_dude;") at postgres.c:1099
#32 0x00000000008cea3c in PostgresMain (argc=1, argv=0x2804e50, dbname=0x2804e28 "<foo_db>", username=0x2804e08 "jfinzel") at postgres.c:4088
#33 0x000000000082369b in BackendRun (port=0x2801170) at postmaster.c:4405
#34 0x0000000000822d02 in BackendStartup (port=0x2801170) at postmaster.c:4077
#35 0x000000000081ee31 in ServerLoop () at postmaster.c:1755
#36 0x000000000081e2d9 in PostmasterMain (argc=3, argv=0x27d79a0) at postmaster.c:1363
#37 0x0000000000751669 in main (argc=3, argv=0x27d79a0) at main.c:228

Here is my compile script that I used to compile 10.5 (at commit 4191e37a9a1fb598267c445c717914012d9bc423) and run.  The cluster with said issue uses extensions compiled below as well:

$ cat make_postgres
#!/bin/bash

set -eu

dirname=$1

instdir=/usr/lib/postgresql/$dirname

# Install Postgres
export PATH=$instdir/bin:/home/jfinzel/bin:/home/jfinzel/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
sudo mkdir $instdir

# This is my directory with source code from git commit 4191e37a9a1fb598267c445c717914012d9bc423
cd ~/postgres_source/postgres
./configure --prefix=$instdir --enable-cassert --enable-debug CFLAGS="-ggdb -g3 -fno-omit-frame-pointer -fPIC"
make
sudo "PATH=$PATH" make install

# Contrib
cd contrib/btree_gist/
sudo "PATH=$PATH" make install
cd ../test_decoding/
sudo "PATH=$PATH" make install

# Install Pglogical
cd /usr/src/pglogical-2.2.1
sudo "PATH=$PATH" make clean
sudo "PATH=$PATH" make install

# Install Extensions
cd $HOME/pgl_ddl_deploy
make clean
sudo "PATH=$PATH" make install
cd $HOME/pglogical_ticker
make clean
sudo "PATH=$PATH" make install
cd $HOME/pg_fact_loader
make clean
sudo "PATH=$PATH" make install

$ ./make_postgres 10.5

Thanks!
Jeremy

Re: Segmentation Fault in logical decoding get/peek API

From
Andres Freund
Date:
Hi,

>    - I definitely got the same segfault on a commit after 10.4 - 0bb28ca
>    - I am now getting a different segfault on 10.5 - but I need another set
>    of eyes to verify I am not compiling it wrong
> 
> After decoding successfully for awhile, now I get an immediate segfault
> upon peek_changes.  First of all, here is the backtrace:

But to confirm, with 10.6 that's gone?


> $ sudo -u postgres gdb -q -c /san/<cluster>/pgdata/core
> /usr/lib/postgresql/10.5/bin/postgres
> Reading symbols from /usr/lib/postgresql/10.5/bin/postgres...done.
> [New LWP 22699]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> Core was generated by `postgres: <cluster>: jfinzel foo_db
> 10.7.111.37(52316) FETCH'.
> Program terminated with signal SIGABRT, Aborted.

Note that this is not a segfault, it's an assertion that failed.


> #0  0x00007eff42d54428 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> (gdb) bt
> #0  0x00007eff42d54428 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x00007eff42d5602a in abort () from /lib/x86_64-linux-gnu/libc.so.6
> #2  0x0000000000a45f9c in ExceptionalCondition (conditionName=0xc2d688
> "!(prev_first_lsn < cur_txn->first_lsn)", errorType=0xc2d404
> "FailedAssertion", fileName=0xc2d478 "reorderbuffer.c", lineNumber=688) at
> assert.c:54

If it's not gone with 10.6, could you print the two variables in that
expression?



> # Install Extensions
> cd $HOME/pgl_ddl_deploy
> make clean
> sudo "PATH=$PATH" make install
> cd $HOME/pglogical_ticker
> make clean
> sudo "PATH=$PATH" make install
> cd $HOME/pg_fact_loader
> make clean
> sudo "PATH=$PATH" make install

It's entirely possible that one of those extensions does something to
violate WAL logging rules :/

Greetings,

Andres Freund


Re: Segmentation Fault in logical decoding get/peek API

From
Andres Freund
Date:
On 2019-02-18 12:15:47 -0800, Andres Freund wrote:
> But to confirm, with 10.6 that's gone?

*10.7


Re: Segmentation Fault in logical decoding get/peek API

From
Jeremy Finzel
Date:
But to confirm, with 10.6 that's gone?

We now have production running on 10.6 (including all those extensions), and have seen no more issues at all.  So I am assuming it must be fixed in 10.6 at this point.
 
If it's not gone with 10.6, could you print the two variables in that
expression?

Well it is gone with 10.6 I think, but here is what you requested on my 10.5 version:
(gdb) frame 2
#2  0x0000000000a45f9c in ExceptionalCondition (conditionName=0xc2d688 "!(prev_first_lsn < cur_txn->first_lsn)", errorType=0xc2d404 "FailedAssertion", fileName=0xc2d478 "reorderbuffer.c", lineNumber=688) at assert.c:54
54 abort();
(gdb) up
#3  0x000000000084b0ac in AssertTXNLsnOrder (rb=0x2775790) at reorderbuffer.c:688
688 Assert(prev_first_lsn < cur_txn->first_lsn);
(gdb) print prev_first_lsn
$2 = 9888781386112
(gdb) print cur_txn->first_lsn
$3 = 9888781386112
 
> # Install Extensions
> cd $HOME/pgl_ddl_deploy
> make clean
> sudo "PATH=$PATH" make install
> cd $HOME/pglogical_ticker
> make clean
> sudo "PATH=$PATH" make install
> cd $HOME/pg_fact_loader
> make clean
> sudo "PATH=$PATH" make install

It's entirely possible that one of those extensions does something to
violate WAL logging rules :/

But I would expect then to still have a problem in our production instance running 10.6, which I don't.

Thanks,
Jeremy 

Re: Segmentation Fault in logical decoding get/peek API

From
Jeremy Finzel
Date:
We now have production running on 10.6 (including all those extensions), and have seen no more issues at all.  So I am assuming it must be fixed in 10.6 at this point.

Somehow I am getting this crash on my compiled 10.6, but I don't get it on packaged 10.6. :(

I've tried this a number of times so I'm not sure how worth it is trying to get to the bottom of the compile difference.  But I would like to understand how the compiled version is different.  Really the only difference should be the compile options used for postgres.

Thanks,
Jeremy

Re: Segmentation Fault in logical decoding get/peek API

From
Alvaro Herrera
Date:
On 2019-Feb-18, Jeremy Finzel wrote:

> >
> > We now have production running on 10.6 (including all those extensions),
> > and have seen no more issues at all.  So I am assuming it must be fixed in
> > 10.6 at this point.
> 
> Somehow I am getting this crash on my compiled 10.6, but I don't get it on
> packaged 10.6. :(

Maybe because the packaged version doesn't have assertions enabled?
(configure --enable-cassert)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Segmentation Fault in logical decoding get/peek API

From
Jeremy Finzel
Date:
Maybe because the packaged version doesn't have assertions enabled?
(configure --enable-cassert)

Of course!  I am getting a failed assertion on 10.6 then.  Is that a concern?  Should I try 10.7?

Thanks,
Jeremy 

Re: Segmentation Fault in logical decoding get/peek API

From
Andres Freund
Date:
Hi,

On 2019-02-18 15:48:05 -0600, Jeremy Finzel wrote:
> >
> > Maybe because the packaged version doesn't have assertions enabled?
> > (configure --enable-cassert)
> >
> 
> Of course!  I am getting a failed assertion on 10.6 then.  Is that a
> concern?  Should I try 10.7?

Yes, please do. And if that fails, is is realistic to test without those
additional extensions?  If not, where's their source?

Greetings,

Andres Freund


Re: Segmentation Fault in logical decoding get/peek API

From
Jeremy Finzel
Date:
Yes, please do. And if that fails, is is realistic to test without those
additional extensions?  If not, where's their source?

So yes, I just reproduced the same assert abort on 10.7, so issue is still there.

I still need pg_fact_loader, but I was able to strip out any custom C code from the test and I am still able to produce the crash.  I dropped extensions pgl_ddl_deploy and pglogical_ticker.  pg_fact_loader can optionally use a pg background worker, but I am bypassing that and thus only running plpgsql functions, and still producing the crash.  This makes me doubt that it could have anything to do with these extensions.

If interested, here is the source for them:

I want to see if I can reproduce the crash on a fresh cluster.  Is there anything else you want me to try at this point?

Thanks,
Jeremy

Re: Segmentation Fault in logical decoding get/peek API

From
Michael Paquier
Date:
On Tue, Feb 19, 2019 at 09:32:15AM -0600, Jeremy Finzel wrote:
> I still need pg_fact_loader, but I was able to strip out any custom C code
> from the test and I am still able to produce the crash.  I dropped
> extensions pgl_ddl_deploy and pglogical_ticker.  pg_fact_loader can
> optionally use a pg background worker, but I am bypassing that and thus
> only running plpgsql functions, and still producing the crash.  This makes
> me doubt that it could have anything to do with these extensions.

The best thing which could be done is to post a set of SQL queries
on this list which allows to reproduce the problem.
--
Michael

Attachment

Re: Segmentation Fault in logical decoding get/peek API

From
Jeremy Finzel
Date:
The best thing which could be done is to post a set of SQL queries
on this list which allows to reproduce the problem.

I am trying, but I am not sure how to do that just yet.  However, I can reliably reproduce the error by letting this system run, meaning I have the ability to attach a debugger to see what is happening more deeply.  I can also provide just what data in the slot is leading to a failed assert.  Would any of this help to narrow down the problem?

Thanks,
Jeremy

Re: Segmentation Fault in logical decoding get/peek API

From
Jeremy Finzel
Date:


On Thu, Feb 14, 2019 at 5:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeremy Finzel <finzelj@gmail.com> writes:
> On Thu, Feb 14, 2019 at 4:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So my bet is that your problem was fixed by some other commit between
>> 10.3 and 10.6.  Maybe the predecessor one, b767b3f2e; but hard to say
>> without more investigation than seems warranted, if the bug's gone.

> I am willing to put in more time debugging this because we want to know
> which clusters are actually susceptible to the bug. Any suggestions to
> proceed are welcome.

Well, as Peter said, "git bisect" and trying to reproduce the problem
at each step would be the way to prove it definitively.  Seems mighty
tedious though.  Possibly you could shave some time off the process
by assuming it must have been one of the commits that touched
reorderbuffer.c ... a quick check says there have been ten of those
in the v10 branch since 10.3.

                        regards, tom lane

I am slowly continuing to pursue this, and I hit a slightly different actual segfault on commit 4346794, still on reorderbuffer.c, this time just 2 lines below the original.

The first segfault I was getting was on:
 Assert(specinsert->data.tp.oldtuple == NULL);

Now it's on this:
change->action = REORDER_BUFFER_CHANGE_INSERT

I only have 3 more steps to test and I should know exactly what commit fixed the segfault possibility.  I still don't know precisely what can reproduce the crash.

Thanks,
Jeremy

Re: Segmentation Fault in logical decoding get/peek API

From
Jeremy Finzel
Date:
On Wed, Feb 27, 2019 at 8:34 AM Jeremy Finzel <finzelj@gmail.com> wrote:
On Thu, Feb 14, 2019 at 5:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeremy Finzel <finzelj@gmail.com> writes:
> On Thu, Feb 14, 2019 at 4:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So my bet is that your problem was fixed by some other commit between
>> 10.3 and 10.6.  Maybe the predecessor one, b767b3f2e; but hard to say
>> without more investigation than seems warranted, if the bug's gone.

> I am willing to put in more time debugging this because we want to know
> which clusters are actually susceptible to the bug. Any suggestions to
> proceed are welcome.

Well, as Peter said, "git bisect" and trying to reproduce the problem
at each step would be the way to prove it definitively.  Seems mighty
tedious though.  Possibly you could shave some time off the process
by assuming it must have been one of the commits that touched
reorderbuffer.c ... a quick check says there have been ten of those
in the v10 branch since 10.3.

                        regards, tom lane

I have confirmed which commit fixed the segfault:

cee1dd1eeda1e7b86b78f240d24bbfde21d75928 is the first fixed commit
commit cee1dd1eeda1e7b86b78f240d24bbfde21d75928
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date:   Tue Mar 6 15:57:20 2018 -0300

    Refrain from duplicating data in reorderbuffers

    If a walsender exits leaving data in reorderbuffers, the next walsender
    that tries to decode the same transaction would append its decoded data
    in the same spill files without truncating it first, which effectively
    duplicate the data.  Avoid that by removing any leftover reorderbuffer
    spill files when a walsender starts.

    Backpatch to 9.4; this bug has been there from the very beginning of
    logical decoding.

    Author: Craig Ringer, revised by me
    Reviewed by: Álvaro Herrera, Petr Jelínek, Masahiko Sawada 


I'm very curious if anyone more familiar with the code base understands how the old code was vulnerable to a segfault, but not the new code.  I don't think in my situation a walsender was exiting at all, so I do think the specific reason for this code change accidentally fixed a potential segfault unrelated to an exiting walsender leaving data in reorderbuffers.

I still can't explain precisely what reproduces it.  As I have said before, some unique things about my setup are lots of INSERT ... ON CONFLICT UPDATE queries, as well as btree gist exclusion indexes, but I suspect this has to do with the former somehow, although I haven't been able to create a reliably reproducing crash.

I suggest at least adding something to the 10.4 (and corollary major version) release notes around this.

Here is the release note now:
  • In logical decoding, avoid possible double processing of WAL data when a walsender restarts (Craig Ringer)
Perhaps some addition like this:
  • In logical decoding, avoid possible double processing of WAL data when a walsender restarts (Craig Ringer)
    This change also had the effect of fixing a potential segfault vulnerability in the logical decoding peek/get_changes API in ReorderBufferCommit

Thanks,
Jeremy