Thread: BUG #6222: Segmentation fault on unlogged table
The following bug has been logged online: Bug reference: 6222 Logged by: Fujii Masao Email address: masao.fujii@gmail.com PostgreSQL version: 9.2dev Operating system: Linux hermes 2.6.38-11-generic #50-Ubuntu SMP Mon Sep 12 21:18:14 UTC 2011 i686 i686 i386 GNU/Linux Description: Segmentation fault on unlogged table Details: When I made pgbench tables unlogged and ran pgbench, I encountered a segmentation fault. Here is the test case which reproduces a segmentation fault: $ pgbench -i $ pg_dump > pgbench.dump $ sed s/"CREATE TABLE"/"CREATE UNLOGGED TABLE"/g pgbench.dump > pgbench_unlogged.dump $ createdb test $ psql -f pgbench_unlogged.dump test $ pgbench -c2 -T60 -Mprepared LOG: server process (PID 23154) was terminated by signal 11: Segmentation fault LOG: terminating any other active server processes WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. <snip> The backtrace from core file shows that SEGV comes from ExecQual(): $ gdb -c data/core bin/postgres <snip> Core was generated by `postgres: postgres test [local] UPDATE '. Program terminated with signal 11, Segmentation fault. #0 0x0821a6fa in ExecQual () (gdb) bt #0 0x0821a6fa in ExecQual () #1 0x0821aff1 in ExecScan () #2 0x0823035e in ExecSeqScan () #3 0x082113fe in ExecProcNode () #4 0x0822e1f0 in ExecModifyTable () #5 0x082113b2 in ExecProcNode () #6 0x0820f273 in ExecutePlan () #7 0x0820d93b in standard_ExecutorRun () #8 0x0820d7af in ExecutorRun () #9 0x0832886f in ProcessQuery () #10 0x08329e7c in PortalRunMulti () #11 0x083295b1 in PortalRun () #12 0x083251a8 in exec_execute_message () #13 0x08327c67 in PostgresMain () #14 0x082d0fb2 in BackendRun () #15 0x082d05c0 in BackendStartup () #16 0x082cd5c8 in ServerLoop () #17 0x082ccf48 in PostmasterMain () #18 0x0824d748 in main () BTW, when I did the same test on MacOS, I got a bus error instead of SEGV.
On Sun, Sep 25, 2011 at 9:16 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > When I made pgbench tables unlogged and ran pgbench, I encountered a > segmentation fault. Here is the test case which reproduces a segmentation > fault: > > $ pgbench -i > $ pg_dump > pgbench.dump > $ sed s/"CREATE TABLE"/"CREATE UNLOGGED TABLE"/g pgbench.dump > > pgbench_unlogged.dump > $ createdb test > $ psql -f pgbench_unlogged.dump test > $ pgbench -c2 -T60 -Mprepared The first time I did this (on MacOS X), I got a bus error, just as you did. But I followed the exact series of commands above, which I believe would end up with the test running against the default database (with regular tables), rather than the test database with unlogged tables. The second time, it worked OK against the regular tables, so I then tried it again against the unlogged tables (by running the last command in the series again with "test" added to the end) and it then crashed with the following backtrace: #0 0x0000000101153560 in ?? () #1 0x00000001001540fe in ExecScan (node=0x7fff5fbfde4f, accessMtd=0x100167150 <SeqNext>, recheckMtd=0x100166ef0 <SeqRecheck>) at execScan.c:195 #2 0x000000010014bcb2 in ExecProcNode (node=0x101151f30) at execProcnode.c:393 #3 0x0000000100164691 in ExecModifyTable (node=0x101151c20) at nodeModifyTable.c:766 #4 0x000000010014bcf2 in ExecProcNode (node=0x101151c20) at execProcnode.c:370 #5 0x000000010014a10e in ExecutePlan [inlined] () at /Users/rhaas/pgsql/src/backend/executor/execMain.c:1438 #6 0x000000010014a10e in standard_ExecutorRun (queryDesc=0x10113f388, direction=ForwardScanDirection, count=0) at execMain.c:312 #7 0x000000010023f714 in ProcessQuery (plan=0x101140e98, sourceText=0x10113f238 "UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2;", params=<value temporarily unavailable, due to optimizations>, dest=0x100520420, completionTag=0x7fff5fbfe450 "") at pquery.c:185 #8 0x000000010023fe77 in PortalRunMulti (portal=0x10113d238, isTopLevel=<value temporarily unavailable, due to optimizations>, dest=0x100520420, altdest=0x100520420, completionTag=0x7fff5fbfe450 "") at pquery.c:1274 #9 0x0000000100240888 in PortalRun (portal=0x10113d238, count=9223372036854775807, isTopLevel=<value temporarily unavailable, due to optimizations>, dest=0x101055a50, altdest=0x101055a50, completionTag=0x7fff5fbfe450 "") at pquery.c:8& #11 0x00000001001ee7ec in ServerLoop () at postmaster.c:3512 #12 0x00000001001ef6d7 in PostmasterMain (argc=1, argv=0x100c08c40) at postmaster.c:1093 #13 0x00000001001824b5 in main (argc=1, argv=0x100c08c40) at main.c:199 Unfortunately, this is thoroughly unhelpful, because the top of the stack has apparently been clobbered, and I can't see where inside ExecQual the crash is happening. (gdb) fr 1 #1 0x00000001001540fe in ExecScan (node=0x7fff5fbfde4f, accessMtd=0x100167150 <SeqNext>, recheckMtd=0x100166ef0 <SeqRecheck>) at execScan.c:195 195 if (!qual || ExecQual(qual, econtext, false)) The whole thing is a bit mysterious because ExecQual() doesn't really do much that seems like it could generate an invalid memory reference. I'll poke at this some more... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Sep 26, 2011 at 11:00 AM, Robert Haas <robertmhaas@gmail.com> wrote: > The whole thing is a bit mysterious because ExecQual() doesn't really > do much that seems like it could generate an invalid memory reference. > > I'll poke at this some more... I added some debugging code which sets a global variable to various values as it executes this code. I was still able to reproduce the crash (but only with unlogged tables, as in your original report) and the crash appears to be happening here: + where_did_we_crash = 104; expr_value = ExecEvalExpr(clause, econtext, &isNull, NULL); + where_did_we_crash = 105; I end up with where_did_we_crash = 104 in the core dump. ExecEvalExpr is a macro which does this: #define ExecEvalExpr(expr, econtext, isNull, isDone) \ ((*(expr)->evalfunc) (expr, econtext, isNull, isDone)) I can't print out the value of "clause" directly, because the invocation of ExecQual() doesn't even show up in the stack trace. But I can see from the backtrace that it's getting called by ExecScan() with an argument of qual. That qual is a one-element list, and it's only element is also a List. That List contains a FuncExprState node with an evalfunc of ExecEvalOper. But unless I'm missing something, that's no good, because ExecQual is only walking the outer list, not the inner one. And certainly if it tries to use a List object as an ExprState, that's not going to work. To check my work, I did this: --- a/src/backend/executor/execQual.c +++ b/src/backend/executor/execQual.c @@ -5003,6 +5003,7 @@ ExecQual(List *qual, ExprContext *econtext, bool resultForNull) Datum expr_value; bool isNull; + Assert(!IsA(clause, List)); expr_value = ExecEvalExpr(clause, econtext, &isNull, NULL); if (isNull) And in fact the test case (when run against the unlogged tables) fails that assertion: TRAP: FailedAssertion("!(!((((Node*)(clause))->type) == T_List))", File: "execQual.c", Line: 5006) Now I'm not too sure why that is happening yet, but I'm leaning toward the idea that the bug here is timing-related and that unlogged tables aren't the cause, but rather just make it easier to hit whatever the race condition is by removing some overhead. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Sep 26, 2011 at 11:40 AM, Robert Haas <robertmhaas@gmail.com> wrote: > To check my work, I did this: > > --- a/src/backend/executor/execQual.c > +++ b/src/backend/executor/execQual.c > @@ -5003,6 +5003,7 @@ ExecQual(List *qual, ExprContext *econtext, bool > resultForNull) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Datum =A0 =A0 =A0 =A0 =A0 expr_value; > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0bool =A0 =A0 =A0 =A0 =A0 =A0isNull; > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 Assert(!IsA(clause, List)); > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr_value =3D ExecEvalExpr(clause, econte= xt, &isNull, NULL); > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (isNull) > > And in fact the test case (when run against the unlogged tables) fails > that assertion: > > TRAP: FailedAssertion("!(!((((Node*)(clause))->type) =3D=3D T_List))", > File: "execQual.c", Line: 5006) > > Now I'm not too sure why that is happening yet, but I'm leaning toward > the idea that the bug here is timing-related and that unlogged tables > aren't the cause, but rather just make it easier to hit whatever the > race condition is by removing some overhead. I cannot reproduce this on commit e6faf910d75027bdce7cd0f2033db4e912592bcc. But on the very next commit I can: commit e6faf910d75027bdce7cd0f2033db4e912592bcc Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Fri Sep 16 00:42:53 2011 -0400 Redesign the plancache mechanism for more flexibility and efficiency. Tom, any thoughts? --=20 Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Sep 26, 2011 at 11:08 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Sep 26, 2011 at 11:40 AM, Robert Haas <robertmhaas@gmail.com> wro= te: >> To check my work, I did this: >> >> --- a/src/backend/executor/execQual.c >> +++ b/src/backend/executor/execQual.c >> @@ -5003,6 +5003,7 @@ ExecQual(List *qual, ExprContext *econtext, bool >> resultForNull) >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Datum =A0 =A0 =A0 =A0 =A0 expr_value; >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0bool =A0 =A0 =A0 =A0 =A0 =A0isNull; >> >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 Assert(!IsA(clause, List)); >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0expr_value =3D ExecEvalExpr(clause, econt= ext, &isNull, NULL); >> >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (isNull) >> >> And in fact the test case (when run against the unlogged tables) fails >> that assertion: >> >> TRAP: FailedAssertion("!(!((((Node*)(clause))->type) =3D=3D T_List))", >> File: "execQual.c", Line: 5006) >> >> Now I'm not too sure why that is happening yet, but I'm leaning toward >> the idea that the bug here is timing-related and that unlogged tables >> aren't the cause, but rather just make it easier to hit whatever the >> race condition is by removing some overhead. > > I cannot reproduce this on commit > e6faf910d75027bdce7cd0f2033db4e912592bcc. =A0But on the very next commit > I can: > > commit e6faf910d75027bdce7cd0f2033db4e912592bcc > Author: Tom Lane <tgl@sss.pgh.pa.us> > Date: =A0 Fri Sep 16 00:42:53 2011 -0400 > > =A0 =A0Redesign the plancache mechanism for more flexibility and efficien= cy. > > Tom, any thoughts? hm. any relation to YAMAMOTO Takashi's recent email, [BUGS] bug in plancache.c, which hasn't hit the archives yet? > "GetCachedPlan can pass the 'qlist' to the planner twice. if i understand the code correctly, it's unsafe because the planner is destructive wrt the input tree. for my application, it often causes a crash in executor." merlin
On Mon, Sep 26, 2011 at 12:17 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > hm. any relation to YAMAMOTO Takashi's recent email, [BUGS] bug in > plancache.c, which hasn't hit the archives yet? > >> "GetCachedPlan can pass the 'qlist' to the planner twice. > if i understand the code correctly, it's unsafe because the planner is > destructive wrt the input tree. =A0for my application, it often causes > a crash in executor." I was just wondering about that. It seems like it could very well be the same issue, but I have not tested it yet. --=20 Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Sep 26, 2011 at 12:20 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Sep 26, 2011 at 12:17 PM, Merlin Moncure <mmoncure@gmail.com> wro= te: >> hm. any relation to YAMAMOTO Takashi's recent email, [BUGS] bug in >> plancache.c, which hasn't hit the archives yet? >> >>> "GetCachedPlan can pass the 'qlist' to the planner twice. >> if i understand the code correctly, it's unsafe because the planner is >> destructive wrt the input tree. =A0for my application, it often causes >> a crash in executor." > > I was just wondering about that. =A0It seems like it could very well be > the same issue, but I have not tested it yet. OK, I tested it. On my system, that patch appears to fix this problem. --=20 Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: >> I was just wondering about that. It seems like it could very well be >> the same issue, but I have not tested it yet. > OK, I tested it. On my system, that patch appears to fix this problem. Yeah, I was just going to suggest that that might be related, but I see you beat me to the deduction. Will commit it (with more than zero comments) in a moment. regards, tom lane