Thread: Segmentation Fault in logical decoding get/peek API
Hi everyone, PostgreSQL backend crashing with segmentation fault 11, while consuming big transaction changes using pg_logical_slot_(peek|get)_changes. We are using postgresql version 9.5.10. About the big transaction : pg_xlog_dump shows 161426 records, and it spread across 3 WAL files. BackTace of the backend : Program received signal SIGSEGV, Segmentation fault. ReorderBufferCommit (rb=0xd17dc8, xid=<value optimized out>, commit_lsn=7224098593984, end_lsn=<value optimized out>, commit_time=<value optimized out>, origin_id=<value optimized out>, origin_lsn=0) at /home/test/compile/../postgresql-9.5.10/src/backend/replication/logical/reorderbuffer.c:1380 1380 change->action = REORDER_BUFFER_CHANGE_INSERT; #0 ReorderBufferCommit (rb=0xd17dc8, xid=<value optimized out>, commit_lsn=7224098593984, end_lsn=<value optimized out>, commit_time=<value optimized out>, origin_id=<value optimized out>, origin_lsn=0) at /home/test/compile/../postgresql-9.5.10/src/backend/replication/logical/reorderbuffer.c:1380 #1 0x0000000000675cd1 in DecodeCommit (ctx=0xcffdb8, record=<value optimized out>) at /home/test/compile/../postgresql-9.5.10/src/backend/replication/logical/decode.c:549 #2 DecodeXactOp (ctx=0xcffdb8, record=<value optimized out>) at /home/test/compile/../postgresql-9.5.10/src/backend/replication/logical/decode.c:234 #3 LogicalDecodingProcessRecord (ctx=0xcffdb8, record=<value optimized out>) at /home/test/compile/../postgresql-9.5.10/src/backend/replication/logical/decode.c:111 #4 0x00000000006772e5 in pg_logical_slot_get_changes_guts (fcinfo=0x7ffea1ef0b90, confirm=0 '\000', binary=0 '\000') at /home/test/compile/../postgresql-9.5.10/src/backend/replication/logical/logicalfuncs.c:442 #5 0x00000000005c4074 in ExecMakeTableFunctionResult (funcexpr=0xce55a0, econtext=0xce5190, argContext=<value optimized out>, expectedDesc=0xce6360, randomAccess=0 '\000') at /home/test/compile/../postgresql-9.5.10/src/backend/executor/execQual.c:2189 #6 0x00000000005d8652 in FunctionNext (node=0xce5080) at /home/test/compile/../postgresql-9.5.10/src/backend/executor/nodeFunctionscan.c:95 #7 0x00000000005c4d73 in ExecScanFetch (node=0xce5080, accessMtd=0x5d83a0 <FunctionNext>, recheckMtd=0x5d7cd0 <FunctionRecheck>) at /home/test/compile/../postgresql-9.5.10/src/backend/executor/execScan.c:95 #8 ExecScan (node=0xce5080, accessMtd=0x5d83a0 <FunctionNext>, recheckMtd=0x5d7cd0 <FunctionRecheck>) at /home/test/compile/../postgresql-9.5.10/src/backend/executor/execScan.c:145 #9 0x00000000005bd808 in ExecProcNode (node=0xce5080) at /home/test/compile/../postgresql-9.5.10/src/backend/executor/execProcnode.c:442 #10 0x00000000005caca6 in fetch_input_tuple (aggstate=0xce4808) at /home/test/compile/../postgresql-9.5.10/src/backend/executor/nodeAgg.c:487 #11 0x00000000005cc286 in agg_retrieve_direct (node=<value optimized out>) at /home/test/compile/../postgresql-9.5.10/src/backend/executor/nodeAgg.c:1683 #12 ExecAgg (node=<value optimized out>) at /home/test/compile/../postgresql-9.5.10/src/backend/executor/nodeAgg.c:1508 #13 0x00000000005bd748 in ExecProcNode (node=0xce4808) at /home/test/compile/../postgresql-9.5.10/src/backend/executor/execProcnode.c:496 #14 0x00000000005bbfdf in ExecutePlan (queryDesc=0xce93a8, direction=<value optimized out>, count=0) at /home/test/compile/../postgresql-9.5.10/src/backend/executor/execMain.c:1549 #15 standard_ExecutorRun (queryDesc=0xce93a8, direction=<value optimized out>, count=0) at /home/test/compile/../postgresql-9.5.10/src/backend/executor/execMain.c:337 #16 0x00000000006c1f57 in PortalRunSelect (portal=0xd2b9c8, forward=<value optimized out>, count=0, dest=<value optimized out>) at /home/test/compile/../postgresql-9.5.10/src/backend/tcop/pquery.c:942 #17 0x00000000006c31c7 in PortalRun (portal=0xd2b9c8, count=9223372036854775807, isTopLevel=1 '\001', dest=0xced478, altdest=0xced478, completionTag=0x7ffea1ef1440 "") at /home/test/compile/../postgresql-9.5.10/src/backend/tcop/pquery.c:786 #18 0x00000000006bff50 in exec_simple_query (query_string=0xccd7e8 "select count(*) from pg_logical_slot_peek_changes('pg_slotnew',null,1000);") at /home/test/compile/../postgresql-9.5.10/src/backend/tcop/postgres.c:1096 #19 0x00000000006c157c in PostgresMain (argc=<value optimized out>, argv=<value optimized out>, dbname=0xc36100 "sasdb", username=<value optimized out>) at /home/test/compile/../postgresql-9.5.10/src/backend/tcop/postgres.c:4049 #20 0x0000000000665d1c in BackendRun (argc=<value optimized out>, argv=<value optimized out>) at /home/test/compile/../postgresql-9.5.10/src/backend/postmaster/postmaster.c:4264 #21 BackendStartup (argc=<value optimized out>, argv=<value optimized out>) at /home/test/compile/../postgresql-9.5.10/src/backend/postmaster/postmaster.c:3938 #22 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at /home/test/compile/../postgresql-9.5.10/src/backend/postmaster/postmaster.c:1703 #23 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at /home/test/compile/../postgresql-9.5.10/src/backend/postmaster/postmaster.c:1311 #24 0x00000000005f4309 in main (argc=3, argv=0xc34b60) at /home/test/compile/../postgresql-9.5.10/src/backend/main/main.c:228 Is it a bug? Is any way i could avoid this? If you guys need any information for debugging, kindly let me know. Thanks in advance, Sudalai ----- sudalai -- Sent from: http://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html
Kindly help me resolve the issue. ----- sudalai -- Sent from: http://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html
Kindly help me resolve this issue. ----- sudalai -- Sent from: http://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html
On 2018-02-20 04:00:07 -0700, Sudalai wrote: > Kindly help me resolve the issue. Please don't spam the list with daily requests for status updates. Especially not two within minutes, less than a day after the original report. Greetings, Andres Freund
Hi, On 2018-02-19 05:20:14 -0700, Sudalai wrote: > PostgreSQL backend crashing with segmentation fault 11, while consuming big > transaction changes using pg_logical_slot_(peek|get)_changes. We are using > postgresql version 9.5.10. Just to be sure, could you check that 9.5.11 shows the same problem? > About the big transaction : pg_xlog_dump shows 161426 records, and it spread > across 3 WAL files. > > BackTace of the backend : > > Program received signal SIGSEGV, Segmentation fault. > ReorderBufferCommit (rb=0xd17dc8, xid=<value optimized out>, > commit_lsn=7224098593984, end_lsn=<value optimized out>, commit_time=<value > optimized out>, origin_id=<value optimized out>, origin_lsn=0) at > /home/test/compile/../postgresql-9.5.10/src/backend/replication/logical/reorderbuffer.c:1380 > 1380 change->action = REORDER_BUFFER_CHANGE_INSERT; > > #0 ReorderBufferCommit (rb=0xd17dc8, xid=<value optimized out>, > commit_lsn=7224098593984, end_lsn=<value optimized out>, commit_time=<value > optimized out>, origin_id=<value optimized out>, origin_lsn=0) at > /home/test/compile/../postgresql-9.5.10/src/backend/replication/logical/reorderbuffer.c:1380 Unfortunately this does not seems to provide enough information to debug the problem. Could you develop a reproducer we can run locally? - Andres
On 02/20/2018 06:55 PM, Andres Freund wrote: > Hi, > > On 2018-02-19 05:20:14 -0700, Sudalai wrote: >> PostgreSQL backend crashing with segmentation fault 11, while consuming big >> transaction changes using pg_logical_slot_(peek|get)_changes. We are using >> postgresql version 9.5.10. > > Just to be sure, could you check that 9.5.11 shows the same problem? > >> About the big transaction : pg_xlog_dump shows 161426 records, and it spread >> across 3 WAL files. >> >> BackTace of the backend : >> >> Program received signal SIGSEGV, Segmentation fault. >> ReorderBufferCommit (rb=0xd17dc8, xid=<value optimized out>, >> commit_lsn=7224098593984, end_lsn=<value optimized out>, commit_time=<value >> optimized out>, origin_id=<value optimized out>, origin_lsn=0) at >> /home/test/compile/../postgresql-9.5.10/src/backend/replication/logical/reorderbuffer.c:1380 This very much seems like a custom PostgreSQL build (judging by the paths in the backtrace), so I wonder what configure options were used. It's not clear if this is a production of testing evironment, but if you could rebuild with debugging symbols / without optimizations, that would be helpful (it would get rid of some of the "optimized out" stuff). >> 1380 change->action = REORDER_BUFFER_CHANGE_INSERT; >> Hmmm, this would mean it's failing here: case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM: /* * Confirmation for speculative insertion arrived. Simply * use as a normal record. It'll be cleaned up at the end * of INSERT processing. */ Assert(specinsert->data.tp.oldtuple == NULL); change = specinsert; change->action = REORDER_BUFFER_CHANGE_INSERT; And if it's segfaulting, it has to mean specinsert is NULL. So either we never got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT, or we threw it away in the "change_done" part. Seems strange in both cases. Sudalai, are you using speculative inserts in the transaction? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Feb 20, 2018 at 1:35 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > And if it's segfaulting, it has to mean specinsert is NULL. So either we > never got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT, or we threw it > away in the "change_done" part. Seems strange in both cases. > > Sudalai, are you using speculative inserts in the transaction? He means ON CONFLICT DO UPDATE. :-) The speculative insertion case handling in reorderbuffer.c is likely something that is hardly ever used, since speculative insertion conflicts are presumably generally very rare. Though at the same time, they're surely very common for certain workloads (such as my testing script for ON CONFLICT DO UPDATE [1]). [1] https://github.com/petergeoghegan/jjanes_upsert -- Peter Geoghegan
>>And if it's segfaulting, it has to mean specinsert is NULL. So either we >>never got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT, or we threw it >>away in the "change_done" part. Seems strange in both cases. Yes, specinsert is NULL. >> Sudalai, are you using speculative inserts in the transaction? Yes . We have done ON CONFLICT DO NOTHING . Is it correct to add below check , if(specinsert == NULL ){ goto change_done; } before , Assert(specinsert->data.tp.oldtuple == NULL); to fix segfault ? Is specinsert is NULL, because of CONFLICT DO NOTHING ? Thanks for the response Tomas, Peter. ----- sudalai -- Sent from: http://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html
On 02/21/2018 08:18 AM, Sudalai wrote: > > >>> And if it's segfaulting, it has to mean specinsert is NULL. So either we >>> never got REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT, or we threw it >>> away in the "change_done" part. Seems strange in both cases. > > Yes, specinsert is NULL. > >>> Sudalai, are you using speculative inserts in the transaction? > Yes . We have done ON CONFLICT DO NOTHING . > > > Is it correct to add below check , > if(specinsert == NULL ){ > goto change_done; > } > before , Assert(specinsert->data.tp.oldtuple == NULL); to fix segfault ? > That seems unlikely to fix the root cause - more likely it's going to mask it in some strange way. We need to understand why it does happen (i.e. specinsert is not expected to be NULL here). It would really help if you could provide a self-contained test case demonstrating the issue, as mentioned by Andres. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
We ran into a segmentation fault using peek_changes that appears identical with what is described in this unfinished thread. We are running pg 10.3, and the segfault was fixed by us upgrading to 10.6. However, we could not find any clearly related fixes in any of the release notes for 10.4-6 relating to this bug fix. I did find this commit that I believe fixes the issue: https://github.com/postgres/postgres/commit/bba8c612117416907f332fce8b9e80b748e0b798 If this indeed fixes a critical issue as we think it does, could someone please add it in the right place to the release notes? Would it be useful to anyone for us to provide any more diagnostics for how we encountered this bug? I have a core dump and stack trace. We encountered it during a peek_changes being run and fetched via a cursor. The same peek_changes from the same LSN actually succeeds after recovery. specinsert is null here: Program terminated with signal SIGSEGV, Segmentation fault. #0 0x000000000084a914 in ReorderBufferCommit (rb=0x1e236c0, xid=316524727, commit_lsn=9825880784664, end_lsn=9825880787976, commit_time=603381553004372, origin_id=0, origin_lsn=0) at reorderbuffer.c:1395 1395 Assert(specinsert->data.tp.oldtuple == NULL); Some interesting notes are that we are doing INSERT ON CONFLICT UPDATE (we are not using DO NOTHING), and the table of interest has a btree gist index. I am certain that the decoding of this SQL process is somehow related to this segfault. Another note of interest is that we have been decoding using pglogical from this system for some time with no issues, so this is definitely unique to the use of the peek_changes function. Please advise, thanks! -- Sent from: http://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html
jfinzel <finzelj@gmail.com> writes: > We ran into a segmentation fault using peek_changes that appears identical > with what is described in this unfinished thread. We are running pg 10.3, > and the segfault was fixed by us upgrading to 10.6. However, we could not > find any clearly related fixes in any of the release notes for 10.4-6 > relating to this bug fix. I did find this commit that I believe fixes the > issue: > https://github.com/postgres/postgres/commit/bba8c612117416907f332fce8b9e80b748e0b798 > If this indeed fixes a critical issue as we think it does, could someone > please add it in the right place to the release notes? You'd have to provide some evidence that that commit actually had something to do with a user-visible problem. It certainly wouldn't, by itself, have completely fixed any problem: at best it'd have transformed it from a crash into an elog(ERROR). More to the point, that commit was intended to silence a probably-hypothetical-anyway warning about the prior commit, so it wasn't fixing any issue that ever saw the light of day in a PG release. So my bet is that your problem was fixed by some other commit between 10.3 and 10.6. Maybe the predecessor one, b767b3f2e; but hard to say without more investigation than seems warranted, if the bug's gone. regards, tom lane
On Thu, Feb 14, 2019 at 2:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > So my bet is that your problem was fixed by some other commit between > 10.3 and 10.6. Maybe the predecessor one, b767b3f2e; but hard to say > without more investigation than seems warranted, if the bug's gone. I wonder if jfinzel is in a position to do a git-bisect. That might be interesting. -- Peter Geoghegan
On Thu, Feb 14, 2019 at 4:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
jfinzel <finzelj@gmail.com> writes:
You'd have to provide some evidence that that commit actually had
something to do with a user-visible problem. It certainly wouldn't,
by itself, have completely fixed any problem: at best it'd have
transformed it from a crash into an elog(ERROR). More to the point,
that commit was intended to silence a probably-hypothetical-anyway
warning about the prior commit, so it wasn't fixing any issue that
ever saw the light of day in a PG release.
And indeed I have not seen such error messages so I suspect you are right there.
So my bet is that your problem was fixed by some other commit between
10.3 and 10.6. Maybe the predecessor one, b767b3f2e; but hard to say
without more investigation than seems warranted, if the bug's gone.
regards, tom lane
I am willing to put in more time debugging this because we want to know which clusters are actually susceptible to the bug. Any suggestions to proceed are welcome.
Thanks,
Jeremy
Jeremy Finzel <finzelj@gmail.com> writes: > On Thu, Feb 14, 2019 at 4:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> So my bet is that your problem was fixed by some other commit between >> 10.3 and 10.6. Maybe the predecessor one, b767b3f2e; but hard to say >> without more investigation than seems warranted, if the bug's gone. > I am willing to put in more time debugging this because we want to know > which clusters are actually susceptible to the bug. Any suggestions to > proceed are welcome. Well, as Peter said, "git bisect" and trying to reproduce the problem at each step would be the way to prove it definitively. Seems mighty tedious though. Possibly you could shave some time off the process by assuming it must have been one of the commits that touched reorderbuffer.c ... a quick check says there have been ten of those in the v10 branch since 10.3. regards, tom lane
Well, as Peter said, "git bisect" and trying to reproduce the problem
at each step would be the way to prove it definitively. Seems mighty
tedious though. Possibly you could shave some time off the process
by assuming it must have been one of the commits that touched
reorderbuffer.c ... a quick check says there have been ten of those
in the v10 branch since 10.3.
Update:
- I definitely got the same segfault on a commit after 10.4 - 0bb28ca
- I am now getting a different segfault on 10.5 - but I need another set of eyes to verify I am not compiling it wrong
After decoding successfully for awhile, now I get an immediate segfault upon peek_changes. First of all, here is the backtrace:
$ sudo -u postgres gdb -q -c /san/<cluster>/pgdata/core /usr/lib/postgresql/10.5/bin/postgres
Reading symbols from /usr/lib/postgresql/10.5/bin/postgres...done.
[New LWP 22699]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: <cluster>: jfinzel foo_db 10.7.111.37(52316) FETCH'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007eff42d54428 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007eff42d54428 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007eff42d5602a in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x0000000000a45f9c in ExceptionalCondition (conditionName=0xc2d688 "!(prev_first_lsn < cur_txn->first_lsn)", errorType=0xc2d404 "FailedAssertion", fileName=0xc2d478 "reorderbuffer.c", lineNumber=688) at assert.c:54
#3 0x000000000084b0ac in AssertTXNLsnOrder (rb=0x28ed790) at reorderbuffer.c:688
#4 0x000000000084ab97 in ReorderBufferTXNByXid (rb=0x28ed790, xid=319299822, create=1 '\001', is_new=0x0, lsn=9888781386112, create_as_top=1 '\001') at reorderbuffer.c:567
#5 0x000000000084d86c in ReorderBufferAddNewTupleCids (rb=0x28ed790, xid=319299822, lsn=9888781386112, node=..., tid=..., cmin=2, cmax=4294967295, combocid=4294967295) at reorderbuffer.c:2053
#6 0x00000000008522b6 in SnapBuildProcessNewCid (builder=0x28f57c0, xid=319299827, lsn=9888781386112, xlrec=0x2821c08) at snapbuild.c:780
#7 0x000000000083f280 in DecodeHeap2Op (ctx=0x28dd720, buf=0x7ffc5b73e2d0) at decode.c:371
#8 0x000000000083ebb1 in LogicalDecodingProcessRecord (ctx=0x28dd720, record=0x28dd9e0) at decode.c:121
#9 0x0000000000844f86 in pg_logical_slot_get_changes_guts (fcinfo=0x7ffc5b73e600, confirm=0 '\000', binary=0 '\000') at logicalfuncs.c:308
#10 0x000000000084514d in pg_logical_slot_peek_changes (fcinfo=0x7ffc5b73e600) at logicalfuncs.c:381
#11 0x00000000006f7973 in ExecMakeTableFunctionResult (setexpr=0x28265b8, econtext=0x28262b0, argContext=0x28b4af0, expectedDesc=0x28d1d20, randomAccess=4 '\004') at execSRF.c:231
#12 0x000000000070a870 in FunctionNext (node=0x2826198) at nodeFunctionscan.c:94
#13 0x00000000006f6f6e in ExecScanFetch (node=0x2826198, accessMtd=0x70a7b9 <FunctionNext>, recheckMtd=0x70aba1 <FunctionRecheck>) at execScan.c:97
#14 0x00000000006f6fdd in ExecScan (node=0x2826198, accessMtd=0x70a7b9 <FunctionNext>, recheckMtd=0x70aba1 <FunctionRecheck>) at execScan.c:147
#15 0x000000000070abef in ExecFunctionScan (pstate=0x2826198) at nodeFunctionscan.c:270
#16 0x00000000006f541a in ExecProcNodeFirst (node=0x2826198) at execProcnode.c:430
#17 0x00000000006ed5af in ExecProcNode (node=0x2826198) at ../../../src/include/executor/executor.h:250
#18 0x00000000006effaf in ExecutePlan (estate=0x2825f80, planstate=0x2826198, use_parallel_mode=0 '\000', operation=CMD_SELECT, sendTuples=1 '\001', numberTuples=2000, direction=ForwardScanDirection, dest=0x27ffc78, execute_once=0 '\000') at execMain.c:1722
#19 0x00000000006edbc4 in standard_ExecutorRun (queryDesc=0x2825130, direction=ForwardScanDirection, count=2000, execute_once=0 '\000') at execMain.c:363
#20 0x00000000006ed9de in ExecutorRun (queryDesc=0x2825130, direction=ForwardScanDirection, count=2000, execute_once=0 '\000') at execMain.c:306
#21 0x00000000008d0dd7 in PortalRunSelect (portal=0x27f70a8, forward=1 '\001', count=2000, dest=0x27ffc78) at pquery.c:932
#22 0x00000000008d200f in DoPortalRunFetch (portal=0x27f70a8, fdirection=FETCH_FORWARD, count=2000, dest=0x27ffc78) at pquery.c:1675
#23 0x00000000008d19df in PortalRunFetch (portal=0x27f70a8, fdirection=FETCH_FORWARD, count=2000, dest=0x27ffc78) at pquery.c:1434
#24 0x00000000006833bb in PerformPortalFetch (stmt=0x2888570, dest=0x27ffc78, completionTag=0x7ffc5b73f0f0 "") at portalcmds.c:199
#25 0x00000000008d2ab6 in standard_ProcessUtility (pstmt=0x28888d0, queryString=0x2887b30 "FETCH FORWARD 2000 FROM crash_dude;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x27ffc78, completionTag=0x7ffc5b73f0f0 "") at utility.c:527
#26 0x00007eff42829eb6 in pglogical_ProcessUtility (pstmt=0x28888d0, queryString=0x2887b30 "FETCH FORWARD 2000 FROM crash_dude;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x27ffc78, completionTag=0x7ffc5b73f0f0 "") at pglogical_executor.c:279
#27 0x00000000008d2547 in ProcessUtility (pstmt=0x28888d0, queryString=0x2887b30 "FETCH FORWARD 2000 FROM crash_dude;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x27ffc78, completionTag=0x7ffc5b73f0f0 "") at utility.c:353
#28 0x00000000008d141b in PortalRunUtility (portal=0x27f6f90, pstmt=0x28888d0, isTopLevel=1 '\001', setHoldSnapshot=1 '\001', dest=0x27ffc78, completionTag=0x7ffc5b73f0f0 "") at pquery.c:1178
#29 0x00000000008d1119 in FillPortalStore (portal=0x27f6f90, isTopLevel=1 '\001') at pquery.c:1038
#30 0x00000000008d09a1 in PortalRun (portal=0x27f6f90, count=9223372036854775807, isTopLevel=1 '\001', run_once=1 '\001', dest=0x28889c8, altdest=0x28889c8, completionTag=0x7ffc5b73f350 "") at pquery.c:768
#31 0x00000000008c9f67 in exec_simple_query (query_string=0x2887b30 "FETCH FORWARD 2000 FROM crash_dude;") at postgres.c:1099
#32 0x00000000008cea3c in PostgresMain (argc=1, argv=0x2804e50, dbname=0x2804e28 "<foo_db>", username=0x2804e08 "jfinzel") at postgres.c:4088
#33 0x000000000082369b in BackendRun (port=0x2801170) at postmaster.c:4405
#34 0x0000000000822d02 in BackendStartup (port=0x2801170) at postmaster.c:4077
#35 0x000000000081ee31 in ServerLoop () at postmaster.c:1755
#36 0x000000000081e2d9 in PostmasterMain (argc=3, argv=0x27d79a0) at postmaster.c:1363
#37 0x0000000000751669 in main (argc=3, argv=0x27d79a0) at main.c:228
Here is my compile script that I used to compile 10.5 (at commit 4191e37a9a1fb598267c445c717914012d9bc423) and run. The cluster with said issue uses extensions compiled below as well:
$ cat make_postgres
#!/bin/bash
set -eu
dirname=$1
instdir=/usr/lib/postgresql/$dirname
# Install Postgres
export PATH=$instdir/bin:/home/jfinzel/bin:/home/jfinzel/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
sudo mkdir $instdir
# This is my directory with source code from git commit 4191e37a9a1fb598267c445c717914012d9bc423
cd ~/postgres_source/postgres
./configure --prefix=$instdir --enable-cassert --enable-debug CFLAGS="-ggdb -g3 -fno-omit-frame-pointer -fPIC"
make
sudo "PATH=$PATH" make install
# Contrib
cd contrib/btree_gist/
sudo "PATH=$PATH" make install
cd ../test_decoding/
sudo "PATH=$PATH" make install
# Install Pglogical
cd /usr/src/pglogical-2.2.1
sudo "PATH=$PATH" make clean
sudo "PATH=$PATH" make install
# Install Extensions
cd $HOME/pgl_ddl_deploy
make clean
sudo "PATH=$PATH" make install
cd $HOME/pglogical_ticker
make clean
sudo "PATH=$PATH" make install
cd $HOME/pg_fact_loader
make clean
sudo "PATH=$PATH" make install
$ ./make_postgres 10.5
Thanks!
Jeremy
Hi, > - I definitely got the same segfault on a commit after 10.4 - 0bb28ca > - I am now getting a different segfault on 10.5 - but I need another set > of eyes to verify I am not compiling it wrong > > After decoding successfully for awhile, now I get an immediate segfault > upon peek_changes. First of all, here is the backtrace: But to confirm, with 10.6 that's gone? > $ sudo -u postgres gdb -q -c /san/<cluster>/pgdata/core > /usr/lib/postgresql/10.5/bin/postgres > Reading symbols from /usr/lib/postgresql/10.5/bin/postgres...done. > [New LWP 22699] > [Thread debugging using libthread_db enabled] > Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". > Core was generated by `postgres: <cluster>: jfinzel foo_db > 10.7.111.37(52316) FETCH'. > Program terminated with signal SIGABRT, Aborted. Note that this is not a segfault, it's an assertion that failed. > #0 0x00007eff42d54428 in raise () from /lib/x86_64-linux-gnu/libc.so.6 > (gdb) bt > #0 0x00007eff42d54428 in raise () from /lib/x86_64-linux-gnu/libc.so.6 > #1 0x00007eff42d5602a in abort () from /lib/x86_64-linux-gnu/libc.so.6 > #2 0x0000000000a45f9c in ExceptionalCondition (conditionName=0xc2d688 > "!(prev_first_lsn < cur_txn->first_lsn)", errorType=0xc2d404 > "FailedAssertion", fileName=0xc2d478 "reorderbuffer.c", lineNumber=688) at > assert.c:54 If it's not gone with 10.6, could you print the two variables in that expression? > # Install Extensions > cd $HOME/pgl_ddl_deploy > make clean > sudo "PATH=$PATH" make install > cd $HOME/pglogical_ticker > make clean > sudo "PATH=$PATH" make install > cd $HOME/pg_fact_loader > make clean > sudo "PATH=$PATH" make install It's entirely possible that one of those extensions does something to violate WAL logging rules :/ Greetings, Andres Freund
On 2019-02-18 12:15:47 -0800, Andres Freund wrote: > But to confirm, with 10.6 that's gone? *10.7
But to confirm, with 10.6 that's gone?
We now have production running on 10.6 (including all those extensions), and have seen no more issues at all. So I am assuming it must be fixed in 10.6 at this point.
If it's not gone with 10.6, could you print the two variables in that
expression?
Well it is gone with 10.6 I think, but here is what you requested on my 10.5 version:
(gdb) frame 2
#2 0x0000000000a45f9c in ExceptionalCondition (conditionName=0xc2d688 "!(prev_first_lsn < cur_txn->first_lsn)", errorType=0xc2d404 "FailedAssertion", fileName=0xc2d478 "reorderbuffer.c", lineNumber=688) at assert.c:54
54 abort();
(gdb) up
#3 0x000000000084b0ac in AssertTXNLsnOrder (rb=0x2775790) at reorderbuffer.c:688
688 Assert(prev_first_lsn < cur_txn->first_lsn);
(gdb) print prev_first_lsn
$2 = 9888781386112
(gdb) print cur_txn->first_lsn
$3 = 9888781386112
> # Install Extensions
> cd $HOME/pgl_ddl_deploy
> make clean
> sudo "PATH=$PATH" make install
> cd $HOME/pglogical_ticker
> make clean
> sudo "PATH=$PATH" make install
> cd $HOME/pg_fact_loader
> make clean
> sudo "PATH=$PATH" make install
It's entirely possible that one of those extensions does something to
violate WAL logging rules :/
But I would expect then to still have a problem in our production instance running 10.6, which I don't.
Thanks,
Jeremy
We now have production running on 10.6 (including all those extensions), and have seen no more issues at all. So I am assuming it must be fixed in 10.6 at this point.
Somehow I am getting this crash on my compiled 10.6, but I don't get it on packaged 10.6. :(
I've tried this a number of times so I'm not sure how worth it is trying to get to the bottom of the compile difference. But I would like to understand how the compiled version is different. Really the only difference should be the compile options used for postgres.
Thanks,
Jeremy
Jeremy
On 2019-Feb-18, Jeremy Finzel wrote: > > > > We now have production running on 10.6 (including all those extensions), > > and have seen no more issues at all. So I am assuming it must be fixed in > > 10.6 at this point. > > Somehow I am getting this crash on my compiled 10.6, but I don't get it on > packaged 10.6. :( Maybe because the packaged version doesn't have assertions enabled? (configure --enable-cassert) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Maybe because the packaged version doesn't have assertions enabled?
(configure --enable-cassert)
Of course! I am getting a failed assertion on 10.6 then. Is that a concern? Should I try 10.7?
Thanks,
Jeremy
Jeremy
Hi, On 2019-02-18 15:48:05 -0600, Jeremy Finzel wrote: > > > > Maybe because the packaged version doesn't have assertions enabled? > > (configure --enable-cassert) > > > > Of course! I am getting a failed assertion on 10.6 then. Is that a > concern? Should I try 10.7? Yes, please do. And if that fails, is is realistic to test without those additional extensions? If not, where's their source? Greetings, Andres Freund
Yes, please do. And if that fails, is is realistic to test without those
additional extensions? If not, where's their source?
So yes, I just reproduced the same assert abort on 10.7, so issue is still there.
I still need pg_fact_loader, but I was able to strip out any custom C code from the test and I am still able to produce the crash. I dropped extensions pgl_ddl_deploy and pglogical_ticker. pg_fact_loader can optionally use a pg background worker, but I am bypassing that and thus only running plpgsql functions, and still producing the crash. This makes me doubt that it could have anything to do with these extensions.
If interested, here is the source for them:
I want to see if I can reproduce the crash on a fresh cluster. Is there anything else you want me to try at this point?
Thanks,
Jeremy
On Tue, Feb 19, 2019 at 09:32:15AM -0600, Jeremy Finzel wrote: > I still need pg_fact_loader, but I was able to strip out any custom C code > from the test and I am still able to produce the crash. I dropped > extensions pgl_ddl_deploy and pglogical_ticker. pg_fact_loader can > optionally use a pg background worker, but I am bypassing that and thus > only running plpgsql functions, and still producing the crash. This makes > me doubt that it could have anything to do with these extensions. The best thing which could be done is to post a set of SQL queries on this list which allows to reproduce the problem. -- Michael
Attachment
The best thing which could be done is to post a set of SQL queries
on this list which allows to reproduce the problem.
I am trying, but I am not sure how to do that just yet. However, I can reliably reproduce the error by letting this system run, meaning I have the ability to attach a debugger to see what is happening more deeply. I can also provide just what data in the slot is leading to a failed assert. Would any of this help to narrow down the problem?
Thanks,
Jeremy
On Thu, Feb 14, 2019 at 5:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeremy Finzel <finzelj@gmail.com> writes:
> On Thu, Feb 14, 2019 at 4:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So my bet is that your problem was fixed by some other commit between
>> 10.3 and 10.6. Maybe the predecessor one, b767b3f2e; but hard to say
>> without more investigation than seems warranted, if the bug's gone.
> I am willing to put in more time debugging this because we want to know
> which clusters are actually susceptible to the bug. Any suggestions to
> proceed are welcome.
Well, as Peter said, "git bisect" and trying to reproduce the problem
at each step would be the way to prove it definitively. Seems mighty
tedious though. Possibly you could shave some time off the process
by assuming it must have been one of the commits that touched
reorderbuffer.c ... a quick check says there have been ten of those
in the v10 branch since 10.3.
regards, tom lane
I am slowly continuing to pursue this, and I hit a slightly different actual segfault on commit 4346794, still on reorderbuffer.c, this time just 2 lines below the original.
The first segfault I was getting was on:
Assert(specinsert->data.tp.oldtuple == NULL);
Now it's on this:
change->action = REORDER_BUFFER_CHANGE_INSERT
I only have 3 more steps to test and I should know exactly what commit fixed the segfault possibility. I still don't know precisely what can reproduce the crash.
Thanks,
Jeremy
On Wed, Feb 27, 2019 at 8:34 AM Jeremy Finzel <finzelj@gmail.com> wrote:
On Thu, Feb 14, 2019 at 5:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:Jeremy Finzel <finzelj@gmail.com> writes:
> On Thu, Feb 14, 2019 at 4:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So my bet is that your problem was fixed by some other commit between
>> 10.3 and 10.6. Maybe the predecessor one, b767b3f2e; but hard to say
>> without more investigation than seems warranted, if the bug's gone.
> I am willing to put in more time debugging this because we want to know
> which clusters are actually susceptible to the bug. Any suggestions to
> proceed are welcome.
Well, as Peter said, "git bisect" and trying to reproduce the problem
at each step would be the way to prove it definitively. Seems mighty
tedious though. Possibly you could shave some time off the process
by assuming it must have been one of the commits that touched
reorderbuffer.c ... a quick check says there have been ten of those
in the v10 branch since 10.3.
regards, tom lane
I have confirmed which commit fixed the segfault:
cee1dd1eeda1e7b86b78f240d24bbfde21d75928 is the first fixed commit
commit cee1dd1eeda1e7b86b78f240d24bbfde21d75928
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Tue Mar 6 15:57:20 2018 -0300
Refrain from duplicating data in reorderbuffers
If a walsender exits leaving data in reorderbuffers, the next walsender
that tries to decode the same transaction would append its decoded data
in the same spill files without truncating it first, which effectively
duplicate the data. Avoid that by removing any leftover reorderbuffer
spill files when a walsender starts.
Backpatch to 9.4; this bug has been there from the very beginning of
logical decoding.
Author: Craig Ringer, revised by me
Reviewed by: Álvaro Herrera, Petr Jelínek, Masahiko Sawada
I'm very curious if anyone more familiar with the code base understands how the old code was vulnerable to a segfault, but not the new code. I don't think in my situation a walsender was exiting at all, so I do think the specific reason for this code change accidentally fixed a potential segfault unrelated to an exiting walsender leaving data in reorderbuffers.
I still can't explain precisely what reproduces it. As I have said before, some unique things about my setup are lots of INSERT ... ON CONFLICT UPDATE queries, as well as btree gist exclusion indexes, but I suspect this has to do with the former somehow, although I haven't been able to create a reliably reproducing crash.
I suggest at least adding something to the 10.4 (and corollary major version) release notes around this.
Here is the release note now:
- In logical decoding, avoid possible double processing of WAL data when a walsender restarts (Craig Ringer)
Perhaps some addition like this:
- In logical decoding, avoid possible double processing of WAL data when a walsender restarts (Craig Ringer)
This change also had the effect of fixing a potential segfault vulnerability in the logical decoding peek/get_changes API in ReorderBufferCommit
Thanks,
Jeremy